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Preface 



This volume contains the papers that were accepted for presentation at the In- 
ternational Conference on TgX, XML, and Digital Typography, jointly held with 
the 25th Annual Meeting of the TgX Users Group in Xanthi, Greece in the sum- 
mer of 2004. The term “Digital Typography” refers to the preparation of printed 
matter using only electronic computers and electronic printing devices, such as 
laser-jet printers. The document preparation process involves mainly the use of a 
digital typesetting system as well as data representation technologies. TgX and 
its offspring are beyond doubt the most successful current digital typesetters, 
while XML is the standard for text-based data representation for both business 
and scientific activities. 

All papers appearing in this volume were fully refereed by the members of the 
program committee. The papers were carefully selected to reflect the research 
work that is being done in the field of digital typography using T[;]X and/or its 
offspring. 

The problems for which comprehensive solutions have been proposed include 
proper multilingual document preparation and XML document processing and 
generation. The proposed solutions deal not simply with typesetting issues, but 
also related issues in document preparation, such as the manipulation of com- 
plex bibliographic databases, and automatic conversion of text expressed in one 
grammatical system to a more recent one (as for the Greek language, converting 
between monotonic Greek and polytonic Greek) . 

The conference is being graciously hosted by the Democritus University of 
Thrace in Xanthi and by the Greek TgX Friends. We wish to thank Basil K. 
Papadopoulos and Georgia Papadopoulou of the Democritus University for their 
generous help and support in the preparation of the conference. Also special 
thanks go to Stratos Doumanis, Georgios Maridakis, and Dimitrios Filippou for 
their invaluable help. Last but not least we thank the Manipulicity of Xanthi for 
their help and support. 
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Digital Typography in the New Millennium 
Flexible Documents by a Flexible Engine 
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GR-157 84 Athens, Greece 
loverdos@di . uoa . gr 
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366, 28th October Str. 
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http: //obelix . ee . duth.gr/~apostolo 



Abstract. The TJ^X family of electronic typesetters contains the pri- 
mary typesetting tools for the preparation of demanding documents, and 
have been in use for many years. However, our era is characterized, among 
others, by Unicode, XML and the introduction of interactive documents. 
In addition, the Open Source movement, which is breaking new ground 
in the areas of project support and development, enables masses of pro- 
grammers to work simultaneously. As a direct consequence, it is reason- 
able to demand the incorporation of certain facilities to a highly modular 
implementation of a Tj^X-like system. Facilities such as the ability to ex- 
tend the engine using common scripting languages (e.g., Perl, Python, 
Ruby, etc.) will help in reaching a greater level of overall architectural 
modularity. Obviously, in order to achieve such a goal, it is mandatory to 
attract a greater programming audience and leverage the Open Source 
programming community. We argue that the successful Tj^-successor 
should be built around a microkernel/ exokernel architecture. Thus, ser- 
vices such as client-side scripting, font selection and use, output routines 
and the design and implementation of formats can be programmed as ex- 
tension modules. In order to leverage the huge amount of existing code, 
and keep document source compatibility, the existing programming in- 
terface is demonstrated to be just another service/module. 



1 Introduction 

The first steps towards computer typesetting took place in the 1950s, but it was 
not until Donald E. Knuth introduced TgX in 1978 [16] that true quality was 
brought to software-based typesetting. The history of TgX is well-known and 
the interested reader is referred to [16] for more details. 

Today, the original TgX is a closed project in the sense that its creator has 
decided to freeze its development. As a direct consequence no other programs 
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are allowed to be called In addition, the freely available source code of 
the system was a major step on the road towards the formation of the Open 
Source movement, which, in turn, borrowed ideas and practices from the Unix 
world. Furthemore, the development of Tj?pC and its companion system, METH- 
FONT, had made obvious the need for properly documented programs. This, in 
turn, initiated Knuth’s creation of the literate programming program develop- 
ment methodology. This methodology advances the idea that the program code 
and documentation should be intermixed and developed simultaneously. 

The source code of TgX and METRFONT being freely available has had enor- 
mous consequences. Anyone can not only inspect the source code, but also ex- 
periment freely with it. Combined with T[^]X’s (primitive, we should note, but 
quite effective for the time) ability to extend itself, this led to such success sto- 
ries as UTj;^ and its enormous supporting codebase, in the form of packages. 
As a direct consequence of the fact that the source code is frozen, stability was 
brought forth. Note that this was exactly the intention Knuth had when devel- 
oping his systems. A common referred-to core, unchanged in the passing of time 
and almost free of bugs, offered a “secure” environment to produce with and 
even experiment with. 

However, in an everchanging world, especially in the fast-paced field of com- 
puter science, almost anything must eventually be surpassed. And it is the emerg- 
ing needs of each era that dictate possible future directions. TgX has undoubtedly 
served its purpose well. Its Turing-completeness has been a most powerful as- 
set/weapon in the battles for and of evolution. Yet, the desired abstraction level, 
needed to cope with increasing complexity, has not been reached. Unfortunately, 
with TgX being bound to a fixed core, it cannot he reached. 

Furthermore, the now widely accepted user-unfriendliness of TgX as a lan- 
guage poses another obstacle to TgX’s evolution. It has created the myth of 
those few, very special and quite extraordinary “creatures” ^ able to decrypt and 
produce code fragments such as the following^: 

\def \s@vig-C{\E00m=\E0(9n 
\divide\EO@n by20 \relax 
\ifmm\E00n>0\s0vig\f i 
\E00k=\E00n\relax 
\multiply\EO@k by-20\relax 
\advance\EOOm by \E0@k\relax 
\global\advcince\EO@l by \@nG 

\GxpandaftGr\xdef \csname EO@d\@roman-C\EO@l}\GndcsnamG{“/o 
\ifnmn\E00m=0\noGxpcind\noGxpaiid\E0zGro 
\GlsG\Gxpcindafter\noGxpand 

\expandaf terXcsname EO\@roman-C\EO@m}\endcsnamG\f i} 

\Gxpandafter\Oright append 

\csname EOOd\@roman-C\EO@l}\GndcsnainG 
\t@\epi01mGcDigits]-} 



Of course, to be fair, programmers in several languages (C and Perl among 
others) are often accused of producing ununderstandable code and the well- 
known obfuscated code contests just prove it. On the other hand, with the ad- 

^ The second author may be regarded as one of Gandalf ’s famuli, while the first author 
is just a Hobbit, wishing to have been an Elf. 

^ Taken from the documentation of the epiolmec package by the second author. 
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vent of quite sophisticated assemblers, today one can even write well-structured 
assembly language, adhering even to “advanced” techniques/paradigms, such as 
object-oriented programming. Naturally, this should not lead to the conclusion 
that we should start writing in assembly (again)! In our opinion, software com- 
plexity should be tackled with an emphasis on abstraction that will eventually 
lead to increased productivity, as is shown in the following figure: 



, requires f \ \ increases r 

Complexity ) *- (Abstraction j •- (Productivity 



TjllX’s programming language is more or less an “assembly language” for 
electronic typesetting. It is true that higher level constructs can be made - 
macros and macro packages built on top of that. But the essence remains the 
same. Although it is true that is essentially bug free and its macro expansion 
facility behaves the way it is specified (i.e., as defined in [9]), it still remains a 
fact that it takes a non-specialist quite some time to fully understand the macro 
expansion rules in spite of Knuth’s initial intentions [12, page 6]. 

The fact that one should program in the language of his/her choice is just 
another reason for moving away from a low-level language. And it is true that 
we envision an environment where as many programmers as possible can - and 
the most important, wish to - contribute. In the era of the Open Source revo- 
lution, we would like to attract the Open Source community and not just a few 
dedicated low-level developers. Open Source should also mean, in our opinion, 
“open possibilities” to evolve the source. This is one of our major motivations 
for reengineering the most successful typesetting engine. 

Richard Palais, the founding chairman of TUG, pointed out back in 1992 [12, 
page 7] that when developing Tf<]X, Knuth 

. . . had NSF grant support that not only provided him with the time and equip- 
ment he needed, but also supported a team of devoted and brilliant graduate 
students who did an enormous amount of work helping design and write the 
large quantity of ancillary software needed to make the TgK system work . . . 

and immediately after this, he poses the fundamental question: 

Where will the resources come from for what will have to be at least an equally 
massive effort? And will the provider of those resources be willing, at the end 
of the project, to put the fruits of all his effort in the Public Domain? 

The answer seems obvious now. The way has been paved by the GNU /Linux/- 
BSD revolutionary development model, as has been explained crystal clearly in 
The Cathedral and the Bazaar [15]. 

This paper is an attempt to define a service-oriented architecture for a fu- 
ture typesetting engine, which will be capable of modular evolution. We take a 
layered approach of designing some core functionality and then define extensible 
services on top of the core. The engine is not restricted to a specific program- 
ming language either for its basic/bootstrapping implementation or, even more 
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important, for its future enhancement. At the same time, we are bound to pro- 
vide a 100% TJi;X-compatible environment, as the only means of supporting the 
vast quantity of existing T[^]X-based documents. We intend to achieve such a 
goal by leveraging the proposed architecture’s own flexibility. Specifically, a TJllX 
compatibility mode is to be supported and it should give complete “trip-test” 
compliance. Later on, we shall see that this compatibility is divided into two 
parts: source code compatibility and internal core compatibility. Both are pro- 
vided by pluggable modules. 

Structure of the Paper. In the following sections we briefly review the most 
important and influential approaches to extending or reengineering TgX, includ- 
ing Tf^X’s inherent abilities to evolve. Then we discuss a few desired character- 
istics for any next generation typesetting engine. We advance by proposing an 
architecture to support these emerging needs. Finally, we conclude by discussing 
further and future work. 

2 A Better 1^^? 

2.1 TfiX the Program 

supports a Turing-complete programming language. Simply, this means 
that if it lacks a feature, it can be programmed. It contains only a few concepts 
and belongs to the LISP family of languages. In particular, it is a list-based 
macro-language with late binding [5, Sec. 3.3]: 

Its data constructs are simpler than in Common Lisp: ‘token list’ is the only 
first order type. Glue, boxes, numbers, etc., are engine concepts; instances of 
them are deseribed by token lists. Its lexical analysis is .simpler than CL: One 
cannot program it. One can only configure it. Its control constructs are simpler 
than in CL: Only macros, no functions. And the macros are only simple ones, 
one ean’t compute in them. 

Further analysis of TJ^X’s notions and inner workings such as category codes, 
Tf^X’s mouth and stomach is beyond the scope of this paper and the interested 
reader is referred to the classic [9] or the excellent [3]. 

IJ^X the program is written in the WEB system of literate programming. 
Thus, its source code is self-documented. The programs tangle and weave are used 
to extract the Pascal code and the documentation, respectively, from the WEB 
code. The documentation is of course specified in the TgX notation. Although 
the d^^X source is structured in a monolithic style, its architecture provides for 
some kind of future evolution. 

First, TgX can be “extended” by the construction of large collections of 
macros that are simply called formats. Each format can be transformed to a 
quickly loadable binary form, which can be thought of as a primitive form of the 
module concept. 

Also, by the prescient inclusion of the \special primitive command, T];?;X 
provides the means to express things beyond its built-in “comprehension” . For 
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example, T[t;X knows absolutely nothing about PostScript graphics, yet by using 
\special and with the appropriate driver program (e.g., dvips), PostScript 
graphics can be easily incorporated into documents. Color is handled in the 
same way. In all cases, all that does is to expand the \special command 
arguments and transfer the command to its normal output, that is, the DVI file 
(a file format that contains only page description commands). 

Last, but not least, there is the notion of change file [3, page 243]: 

A change file is a list of changes to he made to the WEB file; a bit like a stream 
editor script. These changes can comprise both adaptations of the WEB file to 
the particular Pascal compiler that will be used and bug fixes to TfiX. Thus the 
TeX.web file needs never to be edited. 

Thus, change files provide a form of incremental modification. This is similar to 
the patch mechanism of Unix. 

Yet, no matter how foresighted these methods may be, twenty years after its 
conception Tf;]X has started to show its age. Today’s trends, and more impor- 
tantly the programming community’s continuing demand for even more flexible 
techniques and systems, call for new modes of expressiveness. 



2.2 The Format 

UT[;]X [10], which was released around 1985, is the most widely known T[^]X 
format. Nowadays, it seems that UTgX is the de facto standard for the commu- 
nication and publication of scientific documents (i.e., documents that contain 
a lot of mathematical notation) . UTgX “programs” have a Pascal- like structure 
and the basic functionality is augmented with the incorporation of independently 
developed collections of macro packages. In addition, classes are used to define 
major document characteristics and are in essence document types, such as hook, 
article, etc. Thus, each UTgX “program” is characterized by the document class 
to which it belongs, by the packages it utilizes, and any new macro commands 
it may provide. 

The current version of UTeX is called UTgX 2g . Work is in progress to produce 
and widely distribute the next major version, UTeX 3 [11]. Among the several 
enhancements that the new system will bring forth, are: 

— Overall robustness 

— Extensibility, relating to the package interface 

— Better specification and inclusion of graphical material 

— Better layout specification and handling 

— Inclusion of requirements of hypertext systems 

The UTeX 3 core team expects that a major reimplementation of UTeX is needed 
in order to support the above goals. 

The ConTEXt [13] format, developed by Hans Hagen, is monolithic when 
compared to UTeX. As a result, the lessons learned from its development are 
not of great interest to our study. 
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2.3 Afr^- The New Typesetting System 

The Af'j'S project [14] was established in 1992 as an attempt to extend TicX’s 
typesetting capabilities and at the same time to propose a new underlying pro- 
grammatic model. Its originators recognised that TjtpC lacked user-friendliness 
and as a consequence it attracted many fewer users than it could (or should). 
Moreover, TgX (both as a name and a program) was frozen by Knuth, so any 
enhancements should be implemented in a completely new system. 

Afj'S was the first attempt to recognize that TgX’s monolithic structure and 
implementation in an obsolete language (i.e., the Pascal programming language) 
are characteristics that could only impede its evolution. The techniques used to 
implement T[;]X, particularly its “tight” , static and memory conservative data 
structures have no (good) reason to exist today (or even when AfjS was con- 
ceived, in 1992), when we have had a paradigm shift to flexible programming 
techniques. 

After considering and evaluating several programming paradigms [19] in- 
cluding functional, procedural and logic programming, the Afj'S project team 
decided to proceed with a Java-based implementation. Java’s object-oriented 
features and its network awareness were the main reasons for adopting Java, as 
AfjS was envisioned as a network-based program, able to download and combine 
elements from the network. 

Today, there is a Java codebase, which has deconstructed the several func- 
tional pieces of TgX and reconstructed them in a more object-oriented way with 
cleaner interfaces, a property that the original T[;]X source clearly lacks. In spite 
of the promising nature of M'j'S, the directory listing at CTAN^ shows that 
the project is inactive since 2001"*^. It seems that the main focus is now the 
development of £-Tj;]X, which is presented in the following section. 

2.4 £-T^}X 

£-T[<]X [17] was released by the M'J'S team as soon as it was recognized that M'J'S 
itself was very ambitious and that a more immediate and more easily conceivable 
goal should be set. So, it was decided that the first step towards a new typesetting 
system was to start with a reimplemented but 100% TgX compatible program. 

e-TgX was released in 1996, after three years of development and testing. It 
adds about thirty new primitives to the standard TgX core, including handling 
of bidirectional text (right-to-left typesetting) . It can operate in three distinct 
modes: 

1. “compatibility” mode, where it behaves exactly like standard TgX. 

2. “extended” mode, where its new primitives are enabled. Full compatibility 
with Tj;]X is not actually sought and the primary concern is to make type- 
setting easier through its new primitives. 

3. “enhanced” mode, where bidirectional text is also supported. This mode is 
taken to be a radical departure from standard T[;]X. 

® http : //www. ctan.org/tex-archive/ systems/nts/ 

^ We have last accessed the above URL in March 2004. 




Digital Typography in the New Millennium 



7 



Today, e-T^X is part of all widely used distributions and has proven to be 
very stable. Indeed, in 2003 the team requested that future distributions 

use e-T[?;X by default for I^TgX commands, which has since been implemented 
in TJ?;X Live and other distributions. 

2.5 n 

n [16], which was first released in 1996, is primarily the work of two people: 
Yannis Haralambous and John Plaice. It extends TgX in order to support the 
typesetting of multilingual documents, provides new primitives and new fa- 
cilities for this reason, fl’s default character encoding is the Unicode UCS-2 
encoding, while it can easily process files in almost any imaginable character en- 
coding. In addition to that, U supports the parameterization of paragraph and 
page direction, thus allowing the typesetting of text in almost any imaginable 
writing method^. 

Much of its power comes from its new notion of UTPs (U Translation Pro- 
cesses). In general, an UTP is normally used to transform a document from a 
particular character encoding to another. Obviously, an OTP can be used to 
transform text from one character set to another. An OTP is actually a finite 
state automaton and, thus, it can easily handle cases where the typesetting of 
particular characters are context dependent. For example, in traditional Greek 
typography, there are two forms of the small letter theta, which are supported 
by Unicode [namely d (03D1) and 0 (03B8)]. The first form is used at the be- 
ginning of a word, while the second in the middle of a word. The following code 
borrowed from [16] implements exactly this feature: 

input : 2 ; output : 2 ; 
aliases : 

LETTER = (@"03AC-@"03D1 I @"03D5 I @"03D6 I 
@"03F0-@"03F3 I @"1F00-@"1FFF) ; 
expressions : 

~({LETTER})@"03B8({LETTER} I @"0027) 

=> \1 @"3D1 \3; 

. => \1; 

For performance reasons, UTPs are compiled into UCPs (U Compiled Processes). 

External UTPs are programs in any programming language that can han- 
dle problems that cannot be handled by ordinary UTPs. For example, one can 
prepare a Perl script that can insert spaces in a Thai language document. Techni- 
cally, external UTPs are programs that read from the standard input and write 
to the standard output. Thus, U is forking a new process to allow the use of 
an external UTP. In [16] there are a number of examples (some of them were 
borrowed from [7]). 

We should note that the field of multilingual typesetting is an active research 
field, which is the main reason why U is still an experimental system. We should 
also note that £-U [4], by Giuseppe Bilotta, is an extension of U that tries to 
incorporate the best features of e-Tj;]X and U in a new typesetting engine. 

® Currently the boustrophedon writing method is the only one not supported. 
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2.6 pdflEX 

pdfT[?;X [18] is yet another TgX extension that can directly produce a file in 
Adobe’s PDF format. Recently, pdf-e-TgX was introduced, merging the capa- 
bilities of both pdfQ<]X and £t-TeX. 

3 Towards a Universal Typesetting Engine 

From the discussion above, it is obvious that there is a trend to create new type- 
setting engines that provide the best features of different existing typesetting 
engines. Therefore, a Universal Typesetting Engine should incorporate all the 
novelties that the various TgX-like derivatives have presented so far. In addi- 
tion, such a system should be designed by taking into serious consideration all 
aspects of modern software development and maintenance. However, our depar- 
ture should not be too radical, in order to be able to use the existing codebase. 
Let us now examine all these issues in turn. 

3.1 Discussion of Features 

Data Structures. TgX’s inherent limitations are due to the fact that it was 
developed in a time when computer resources were quite scarce. In addition, 
Tf^X was developed using the now outdated structured programming program 
development methodology. 

Nowadays, hardware imposes virtually no limits in design and development 
of software. Also, new programming paradigms (e.g., aspect-oriented program- 
ming [8], generative programming [2], etc.) and techniques (e.g., extreme pro- 
gramming [1]) have emerged, which have substantially changed the way software 
is designed and developed. 

These remarks suggest that a new typesetting engine should be free of “arti- 
ficial” limitations. Naturally, this is not enough as we have to leave behind the 
outdated programming techniques and make use of modern techniques to ensure 
the future of the Universal Typesetting Engine. Certainly, Af'j'S was a step in the 
right direction, but in the light of current developments in the area of software 
engineering it is now a rather outdated piece of software. 

New Primitive Commands. Modern document manipulation demands new ca- 
pabilities that could not have been foreseen at the time TgX was created. A 
modern typesetting engine should provide a number of new primitive commands 
to meet the new challenges imposed by modern document preparation. Although 
the new primitives introduced by and U solve certain problems (e.g., bidi- 

rectional or, more generally, multidirectional typesetting), they are still unable 
to tackle other issues, such as the inclusion of audio and/or animation. 

Input Formats. For reasons of compatibility, the current input format must 
be supported. At the same time the proliferation of XML and its applications 
makes it more than mandatory to provide support for XML content. Currently, 
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XMLTeX is a TeX format that can be used to typeset validated XML files®. In 
addition, XIXTeX [6] is an effort to reconcile the TgX world with the XML world. 
In particular, XIXTgX is an XML Document Type Definition (DTD) designed 
to provide an XMLized syntax for DTgX. However, we should learn from the 
mistakes of the past and make the system quite adaptable. This means that as 
new document formats emerge, the system should be easily reconfigurable to 
“comprehend” these new formats. 

Output Formats. The pdfDTEX variant has become quite widespread, due to its 
ability to directly produce output in a very popular document format (namely 
Adobe’s Portable Document Format). Commercial versions of TgX are capable 
of directly generating PostScript files without the need of any driver programs. 
However, as in the case of the input formats, it is quite possible that new doc- 
ument formats will appear. Thus, we need to make sure that these document 
formats will find their way into TgX sooner or later. 

In addition, XML initiatives such as MathML and SVG (Scalable Vector 
Graphics) are increasingly common in electronic publishing of scientific docu- 
ments (i.e., quite demanding documents from a typographical point of view). 
Thus, it is absolutely necessary to be able to choose the output format (s) from a 
reasonable list of options. For example, when one makes a drawing using DTgX’s 
picture environment, it would be quite useful to have SVG output in addition 
to the “standard” output. Currently, ft can produce XML content, but it cannot 
generate PDF files. 

Innovative Ideas. The assorted typesetting engines that follow TgX’s spirit are 
not mere extensions of d^^X. They have introduced a number of useful features 
and/or capabilities. For example, id’s flTPs and its ability to handle Unicode 
input by default should certainly make their way into a new typesetting en- 
gine. In addition, e-TgX’s new conditional primitives are quite useful in macro 
programming. 

Typesetting Algorithms. The paragraph breaking and hyphenation algorithms 
in TgX make the difference when it comes to typographic quality. Robust and 
adaptable as they are, these algorithms may still not produce satisfactory results 
for all possible cases. Thus, it is obvious that we need a mechanism that will 
adapt the algorithms so they can successfully handle such difficult cases. 

Fonts. Typesetting means to put type (i.e., font glyphs) on paper. Currently, 
only METflFONT fonts and PostScript Type 1 fonts can be used with all different 
TgX derivatives. Although U is Unicode aware, still it cannot handle TrueType 
fonts in a satisfactory degree (one has to resort to programs like ttf2tfm in 
order to make use of these fonts). In addition, for new font formats such as 

® Validation should be handled by an external utility. After all, there are a number of 
excellent tools that can accomplish this task and thus it is too demanding to ask for 
the incorporation of this feature in a typesetting engine. 
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OpenType and SVG fonts there is only experimental support, or none at all. A 
new typesetting engine should provide font support in the form of plug-ins so 
that support for new font formats could be easily provided. 

Scripting. Scripting is widely accepted as a means of producing a larger soft- 
ware product from smaller components by “gluing” them together. It plays a 
significant role in producing flexible and open systems. Its realization is made 
through the so-called “scripting languages” , which usually are different from the 
language used to implement the individual software components. 

One could advance the idea that scripting in is possible by using 
the language itself. This is true to some extent, since TgX works in a form of 
“interpretive mode” where expressions can be created and evaluated dynamically 
at runtime - a feature providing the desired flexibility of scripting languages. But 
itself is a closed system, in that almost everything needs to be programmed 
within Tf;]X itself. This clearly does not lead to the desired openness. 

A next generation typesetting engine should be made of components that can 
be “glued” together using any popular scripting language. To be able to program 
in one’s language of choice is a highly wanted feature. In fact, we believe it is 
the only way to attract as many contributors as possible. 

Development Method. Those software engineering techniques which have proven 
successful in the development of real-world applications should form the core 
of the program methodology which will be eventually used for the design and 
implementation of a next generation typesetting engine. Obviously, generic pro- 
gramming and extreme programming as well as aspect-oriented programming 
should be closely examined in order to devise a suitable development method. 

All the features mentioned above as well as the desired ones are summarized 
in Table 1. 



Table 1. Summary of features of 'I^X and its extensions. 
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3.2 Architectural Abstractions 

Roughly speaking, the Universal Typesetting Engine we are proposing in this 
paper, is a project to design and, later, to implement a new system that will 
support all the “good features” incorporated in various T[;]X derivatives plus 
some novel ideas, which have not found their way in any existing TgX derivative. 

Obviously, it is not enough to just propose the general features the new 
system should have - we need to lay down the concrete design principles that 
will govern the development of the system. A reasonable way to accomplish 
this task is to identify the various concepts that are involved. These concepts 
will make up the upper abstraction layer. By following a top-down analysis, 
eventually, we will be in position to have a complete picture of what is needed 
in order to proceed with the design of the system. 

The next step in the design process is to choose a particular system architec- 
ture. and its derivatives are definitely monolithic systems. Other commonly 
used system architectures include the microkernel and exokernel architectures, 
both well-known from operating system research. 

Microkernel Architecture. A microkernel-based design has a number of ad- 
vantages. First, it is potentially more reliable than a conventional monolithic 
architecture, as it allows for moving the major part of system functionality to 
other components, which make use of the microkernel. Second, a microkernel 
implements a flexible set of primitives, providing high level of abstraction, 
while imposing little or no limitations on system architecture. Therefore, 
building a system on top of an existing microkernel is significantly easier 
than developing it from scratch. 

Exokernel Architecture. Exokernels follow a radically different approach. As 
with microkernels, they take as much out of the kernel as possible, but rather 
than placing that code into external programs (mostly user-space servers) as 
microkernels do, they place it into shared libraries that can be directly linked 
into application code. Exokernels are extremely small, since they arbitrarily 
limit their functionality to the protection and multiplexing of resources. 

Both approaches have their pros and cons. We believe that a mixed approach 
is the best solution. For example, we can have libraries capable of handling 
the various font formats (e.g.. Type 1, TrueType, OpenType, etc.) that will be 
utilized by external programs that implement various aspects of the typesetting 
process (e.g., generation of PostScript or PDF files). Let us now elaborate on the 
architecture we are proposing. The underlying components are given in Figure 1. 

The Typesetting Kernel (TK) is one of the two core components at the first 
layer. It can be viewed as a “stripped-down” version of TgX, meaning that its 
role as a piece of software is the orchestration of several typesetting activities. 
A number of basic algorithms are included in this kernel both as abstract no- 
tions - necessary for a general-purpose typesetting engine - and concrete imple- 
mentations. So, TK incorporates the notions of paragraph and page breaking, 
mathematical typesetting and is Unicode-aware. It must be emphasized that 
TK “knows” the concept of paragraph breaking and the role it plays in typeset- 
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latex && bibtex && latex 
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IATeX, 
Type 3, dvips + Type 1 

Li ( TK ) (ask) 

TfeX, e-TgX, ^ 



Terms 

TK Typesetting Kernel 
ASK Active Scripting Kernel 
TAs Typesetting Algorithms 
DMs Document Models 
SEs Scripting Engines 
HyP Hyphenation Patterns 
WFs Workflows 



Fig. 1. The proposed microkernel-based layered architecture. The arrows show rough 
correspondence between the several architectural abstractions and their counterparts 
in existing monolithic typesetting engines. 



ting but it is not bound to a specific paragraph breaking algorithm. The same 
principle applies to all needed algorithms. 

The Active Scripting Kernel (ASK) is the second of the core components and 
the one that allows scripting at various levels, using a programming (scripting) 
language of one’s choice. It is in essence a standardized way of communicating 
between several languages (TgX, Perl, Python), achieved by providing a consis- 
tent Application Programming Interface (API). The most interesting property 
of ASK is its activeness. This simply means that any extension programmed 
in some language is visible to any other available languages, as long as they 
adhere to the standard Active Scripting Kernel API. For example, an external 
module/service written in Perl that provides a new page breaking algorithm is 
not only visible but also available for immediate use from Python, C, etc. 

Above TK and ASK, at the second layer, we find a collection of typesetting 
abstractions. 

Fonts are at the heart of any typesetting engine. It is evident that font archi- 
tectures change with the passing of time, and the only way to allow for flexibility 
in this part is to be open. Although there many different font formats, all are 
used to define glyphs and their properties. So instead of directly supporting all 
possible font formats, we propose the use of an abstract font format (much like 
all font editors have their own internal font format). With the use of external 
libraries that provide access to popular font formats (e.g., a Free Type library, 
a Type 1 font library, etc.), it should be straightforward to support any existing 
or future font format. 

The various Typesetting Algorithms (TAs) - algorithms that implement a 
particular typographic feature - should be coded using the Active Scripting 
Kernel API. In a system providing the high degree of flexibility we are proposing, 
it will be possible to exhibit, in the same document, the result of applying several 
paragraph and page breaking algorithms. By simply changing a few runtime 
parameters it will be possible to produce different typographic “flavors” of the 
same document. 
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A Scripting Engine (SE) is the realization of the ASK APIs for a particular 
scripting language. For reasons of uniformity, the programming language 
will be provided as a Scripting Engine, along with engines for Perl, Ruby and 
Python. This will make all the existing TJllX codebase available for immediate use 
and it will provide for cooperation between existing ETeX packages and future 
enhancements in other languages. Thus, a level of 100% TgX compatibility will 
be achieved, merely as a “side-effect” of the provided flexibility. 

The idea of a Document Model (DM) concerns two specific points: The doc- 
ument external representation, as it is “edited” for example in an editor, or 
“saved” on a hard disk, and its internal representation, used by the typesetting 
engine itself. It is clear that under this distinction, current DTeX documents 
follow the (Actional) “DTeX Document Model”, XDTgX documents follow the 
“XDTeX document model” and an XML document with its corresponding DTD 
follows an analogous “XML+DTD Document Model”. 

We strongly believe that how a document is written should be separated from 
its processing. For the last part, an internal representation like the Abstract Syn- 
tax Trees (ASTs) used in compiler technology is highly beneficial. One way to 
think of DM is as the typographic equivalent of the Document Object Model 
(DOM). That is, it will be a platform-neutral and language-neutral represen- 
tation allowing scripts to dynamically access and update the content, structure 
and style of documents. 

Several Document Processors (DPs) may be applied to a specific document 
before actual typesetting takes place. DPs are the analog of OTPs. By lever- 
aging the scripting power of ASK, the representation expressiveness of DPs 
is increased - as opposed to algorithmic expressiveness (Turing-completeness), 
which is evident, e.g., in O, but is not the sole issue. 

The Workflows (WF) and Tools are at the highest architectural layer. Cur- 
rently, there are a number of tools that may not produce a Anal typeset result, 
but are important for the proper preparation of a document. For example, such 
tools include bibliography, index and glossary generation tools. In the proposed 
architecture, all these programs will take advantage of other architectural ab- 
stractions - such as the Document Model or the Scripting Engines - in order to 
be more closely integrated in the typesetting engine as a whole. 

Of particular importance is the introduction of the Workflows notion. A 
workflow is closely related to the operation or, to be more precise, cooperation 
of several tools and the typesetting engine in the course of producing a type- 
set document. In effect, a workflow specifies the series of execution (probably 
conditional) steps and the respective inputs/outputs during the “preparation” 
of a document. By introducing a workflow specification for each tool, we relieve 
the user from manually specifying all the necessary actions in order to get a 
“final” .pdf (or whatever output format has been requested). Instead, the user 
will declaratively specify that the services of a tool are needed and the engine 
will load the respective workflows, compose them and execute them. 
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We shall give a workflow example concerning a BisTgX-like tool. What we do 
here is to transform our experience of using bibtex into declarations specifying 
its behaviour in cooperation with latex: 

WORKFLOW DEFINITION bibtex 

SERVICE bibtex NEEDS latex 
SERVICE bibtex INTRODUCES latex 

In effect, this translates a hypothetical Makefile: 

all: 

latex mydoc 
bibtex mydoc 
latex mydoc 

for the preparation of the fictitious mydoc . tex document into a declarative spec- 
ification that is given only once as part of the bibtex tool! 

3.3 On Design and Evolution 

Recent advances in software engineering advocate the use of multidimensional 
separation of concerns as a guiding design principle. Different concerns should be 
handled at different parts of code and ideally should be separated. For example, 
the representation of a document and its processing are two separate concerns 
and should be treated as such. Their interaction is better specified out of their 
individual specifications. Thus, we have introduced the Document Models no- 
tion to cope with the existing T[;]X/DT[;]X base as well as any future document 
representation. 

Several architectural abstractions of Figure 1 are candidates to be specified 
as “services” at different granularities. For example, any Tool of the third layer 
can be thought of as a service that is registered with a naming authority and 
discovered dynamically, for immediate use on demand. A TrueType Font Service, 
regarding the second layer Font abstraction, is another example, this time more 
of a fine-grained nature, in the sense that a Tool (coarse-grained service) utilizes 
a Font (fine-grained service). 

The proposed architecture makes special provisions for evolution by keeping 
rigid design decisions to a minimum. Built-in Unicode awareness is such a notable 
rigid design decision, but we feel that its incorporation is mandatory. Besides 
that, the ideas of pluggable algorithms and scripting are ubiquitious and help 
maintain the desired high degree of flexibility. 

At the programming level, any style of design and development that promotes 
evolution can be applied. In the previous section we have actually demonstrated 
that the proposed architecture can even handle unanticipated evolution at the 
workflow level: the bibtex tool workflow specification causes the execution of 
an existing tool (latex) but we have neither altered any workflow for latex nor 
does latex need to know that “something new” is using it. In effect, we have 
introduced (the use of the keyword INTRODUCE was deliberate) a new aspect [8]. 
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4 Conclusions and Future Work 

In this paper we have reviewed the most widespread modern approaches to 
extending TgX, THE typesetting engine. After analyzing weaknesses of the ap- 
proaches and the existing support for several features, we have presented our 
views on the architecture of an open and flexible typesetting engine. 

We have laid down the basic architectural abstractions and discussed their 
need and purpose. Of course, the work is still at the beginning stages and we 
are now working on refining the ideas and evaluating design and implementation 
approaches. 

The introduction of the Active Scripting Kernel is of prime importance and 
there is ongoing work to completely specify a) the form of a standard procedural 
API and b) support for other programming styles, including object-oriented and 
functional programming. This way, an object may for example take advantage 
of an algorithm that is better described in a functional form. There are paral- 
lel plans for transforming TgX into a Scripting Engine and at the same time 
providing Engines powered by Perl and Python. 

We are also investigating the application of the workflow approach at several 
parts in the architecture other than the interaction among tools. This, in turn, 
may raise the need for the incorporation of a Workflow Kernel at the core layer, 
along with the Typesetting Kernel and the Active Scripting Kernel. 
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Abstract. The code for the Typesetting System has been substan- 
tially reorganised. All fixed-size arrays implemented in Pascal Web have 
been replaced with interfaces to extensible C-I-+ classes. The code for 
interaction with fonts and Translation Processes (flTP’s) has been 
completely rewritten and placed in C++ libraries, whose methods are 
called by the (now) context-dependent typesetting engine. The Pascal 
Web part of Q no longer uses change files. The overall architecture is 
now much cleaner than that of previous versions. 

Using C++ has allowed the development of object-oriented interfaces 
without sacrificing efficiency. By subclassing or wrapping existing stream 
classes, character set conversion and flTP filter application have been si- 
multaneously generalised and simplified. Subclassing techniques are cur- 
rently being used for handling fonts encoded in different formats, with a 
specific focus on OpenType. 



1 Introduction 

In this article, we present the interim solution for the stabilisation of the existing 
code base, with a view towards preparing for the design and implementation 
of a new system. We focus on the overall structure of the code as well as on 
specific issues pertaining to characters, fonts, flTP’s and hyphenation. 

Since the first paper on was presented at the 1993 Aston TUG Conference, 
numerous experiments with have been undertaken in the realm of multilingual 
typesetting and document processing. This overall work has given important in- 
sights into what a future document processing system, including high quality 
typesetting, should look like. We refer the reader to the 2003 TUG presenta- 
tion [7], as well as to the position papers presented to the Kyoto Glyph and 
Typesetting Workshop [3,6,8]. Clearly, building an extensive new system will 
require substantial effort and time, both at the design and the implementation 
levels, and so it is a worthwhile task to build a production version of that will 
be used while further research is undertaken. 
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The standard web2c infrastructure, which assumes that a binary is created 
from a single Pascal Web file and a single Pascal Web change file, is simply not 
well suited for the development of large scale software, of any genre. For this 
reason, we have eliminated the change files, and broken up the Pascal Web file 
into chapter-sized files. All fixed-size arrays have been reimplemented in C-H- 
using the Standard Template Library. Characters are now 32 bits, using the 
wchar_t data type, and character set conversion is done automatically using the 
routines available in the iconv library. The entire Pascal Web code for fonts and 
riTP’s, including that of Donald Knuth, has been completely rewritten in C-I-+ 
and placed in libraries. Clean interfaces have been devised for the use of this 
code from the remaining Pascal code. 

2 Problems with Pascal Web 

When we examine the difficulties in creating as a derivation of tex.web, we 
should understand that there is no single source for these difficulties. 

Pascal was designed so that a single-pass compiler could transform a mono- 
lithic program into a running executable. Therefore, all data types must be 
declared before global variables; in turn, all variables must be declared before 
subroutines, and the main body of code must follow all declarations. This choice 
sacrificed ease of programming for ease of compiler development; the resulting 
constraints can be felt by anyone who has tried to maintain the "hlX engine. 

Pascal Web attempts to alleviate this draconian language vision by allowing 
the arbitrary use within code blocks - called modules - of pointers to other 
modules, with a call- by-name semantics. The result is a programming environ- 
ment in which the arbitrary use of GOTOs throughout the code is encouraged, 
more than ten years after Dijkstra’s famous paper. Knuth had responded cor- 
rectly to Dijkstra’s paper, stating that the reasonable use of GOTOs simplifies 
code. However, the arbitrary use of GOTOs across a program, implicit in the 
Pascal Web methodology, restricts code scalability. Knuth himself once stated 
that one of the reasons for stopping work on was his fear of breaking it. 

For a skilled, attentive programmer such as Knuth, developing a piece of 
code that is not going to evolve, it is possible to write working code in Pascal 
Web, up to a certain level of complexity. However, for a program that is to 
evolve significantly, this approach is simply not tenable, because the monolithic 
Pascal vision is inherited in Pascal Web’s change file mechanism. Modifications 
to T[(]X are supposed to be undertaken solely using change files; the problem 
with this approach is that the vision of the code maintainer is that they are 
modifying functions, procedures, and so on. However, the real structure of a 
Pascal Web program is the interaction between the Pascal Web modules, not 
the functions and procedures that they define. Hence maintaining a Pascal Web 
program is a very slow process. Back in 1993, when the first H work was being 
undertaken, “slow” did not just mean slow in design and programming, but also 
in compilation: the slightest modification required a 48-minute recompilation. 
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The size limitations created by tex. web’s compile-time fixed-size arrays are 
obvious and well known. This issue was addressed publicly by Ken Thompson 
in the early 1980s, and both the existing O and the web2c distribution have 
substantially increased the sizes. However, these arrays raise other problems. The 
eqtb, str_pool, font_info and mem arrays all have documented programming 
interfaces. However, whenever these interfaces are insufficient, the code 
simply makes direct accesses into the arrays. Hence any attempt to significantly 
modify these basic data structures requires the modification of the entire Tjr;X 
engine, and not simply the implementations of the structural interfaces. 

In addition, the single input buffer for all active files of tex. web turns out 
to be truly problematic for implementing flTP’s. Since an HTP can read in 
an arbitrary amount of text before processing it, a new input buffer had to be 
introduced to do this collection. The resulting code is anything but elegant, and 
could certainly be made more efficient. 

Finally, problems arise from the web2c implementation of Pascal Web. Many 
of the routines written in C to support the web2c infrastructure make the implicit 
assumption that all characters are 8 bits, making it difficult to generalise to 
Unicode (currently 21 bits), even though C itself has a datatype called wchar_t. 

3 Suitability of C++ 

The advantages of the use of C- 1 -- 1 - as an implementation language for stream- 
oriented typesetting, over the Pascal Web architecture, are manifold. The chief 
reason for this is that the rich set of tools and methodologies that have evolved 
in the twenty-five years since the introduction of TgX includes developments not 
only in programming languages and environments, but in operating systems, 
file structure, multiprocessing, and in the introduction of whole new paradigms, 
including object-oriented software and generic programming. 

C++ is the de facto standard for object-oriented systems development, with 
its capability to provide low-level C-style access to data structures and system 
resources (and, in the case of Unix-like systems, direct access to the kernel system 
call APi), for the sake of efficiency. 

In addition, the C++ Standard Template Library (stl) offers built-in sup- 
port for arbitrary generic data structures and algorithms, including extensible, 
random-access arrays. It would be foolish to ignore such power when it is so 
readily available. 

Since C++ is fully compatible with C, one can still take advantage of many 
existing libraries associated with T[jpC, such as Karl Berry’s kpathsea file search- 
ing library, and the iconv library character-set conversion between Unicode and 
any other imaginably-used character set. 

The abilities to use well-known design patterns for generic algorithm support 
(plug-in paragraphers, generic stream manipulation), as well as generic repre- 
sentation of typesetting data itself, add a wealth of possibilities to future, open 
typesetting implementations. 
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4 Organisation of the ft Code Base 

Obviously, we are moving on. Our objective is to include the existing O function- 
ality, to stretch it where appropriate, leaving clean interfaces so that, if others 
wish to modify the code base, they can do so. Our current objective is not to 
rewrite Tj^, but its underlying infrastructure. 



4.1 Reorganising the Pascal Web Code 

The tex . web file has been split into 55 files called 01 . web to 55 . web. The tex . ch 
file has been converted into 55 files, 01. ch to 55. ch. Data structure by data 
structure - specifically the large fixed-size arrays - we have combed the code, 
throwing out the definitions of the data structures and replacing their uses with 
Pascal procedure calls which, once passed through the web2c processor, become 
C-I-+ method calls. In the process, most of the code in the change files ends up 
either being unnecessary, or directly integrated in the corresponding . web files. 

4.2 The External Interface with 

We envisage that will be used in a number of different situations, and not 
simply as a batch standalone program. To facilitate this migration, we have 
encapsulated the interface to the external world into a single class. This interface 
handles the interpretation of the command line, as well as the setup for the file 
searching routines, such as are available in the kpathsea library. Changing this 
class will allow the development of an typesetting server, which could be used 
by many different desktop applications. 

4.3 Characters, Strings and Files 

The other interface to the outside world is through the data passed to itself. 
This data is in the form of text files, whose characters are encoded in a multitude 
of different character encodings. 

For characters, TgX has two types, ASCII_code and text_char, the respec- 
tive internal and external representations of 8-bit characters. The new 12 uses the 
standard C/C-i-i- datatype, wchar_t. On most implementations, including GNU 
C-I-+, wchar_t is a 32-bit signed integer, where the values 0x0 to OxTfffffff 
are used to encode characters, and the value Oxffffffff (-1) is used to encode 
EOF. Pascal Web strings are converted by the tangle program into str_number, 
where values 0 to 255 are reserved for the 256 8-bit characters. We have modi- 
fied tangle so that the strings are numbered -256 downwards, rather than 256 
upwards. Hence, str_number and wchar_t are both 32-bit signed integers. 

When dealing with files, there are two separate issues, the file names, and the 
file content. Internally, all characters are 4- byte integers, but on most systems, file 
names are stored using 8-bit encodings, specified according to the user’s locale. 
Hence, character-set conversion is now built into the file-opening mechanisms, 
be they for reading or writing. 
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The actual content of the files may come from anywhere in the world and 
a single file system may include files encoded with many different encoding 
schemes. We provide the means for opening a file with a specified encoding, 
as well as opening a file with automatic character encoding detection, using a 
one-line header at the beginning of the file. The actual character set conversion 
is done using the iconv library. As a result of these choices, the vast majority 
of the O code can simply assume that characters are d-byte Unicode characters. 

In addition to the data files, the following information must be passed through 
a character encoding converter: command line input, file names, terminal input, 
terminal output, log file output, generated intermediate files, and \special out- 
put to the . dvi file. 



4.4 The Fixed-Size Arrays 

The core of the the new U implementation is the replacement of the large fixed- 
size arrays, which are quickly summarized in the table below: 
str_pool string pool 

buffer input buffer 

eqtb, etc. table of equivalents 

f ont_inf o, etc. font tables 
mem dynamically allocated nodes 

trie, etc. hyphenation tables 

For the cumulative data arrays, such as the string pool, we have created a new 
class. Collection, subclass of vector, that can be dump’ed to and undump’ed 
from the format file. 

Currently no work has been done with the dynamically allocated nodes and 
the hyphenation tables. Replacing the mem array with any significantly different 
structure for the nodes would effectively mean rewriting all of TgX, which is not 
our current goal. 

4.5 The String Pool 

The Tj;]X implementation used two arrays: str_pool contained all of the strings, 
concatenated, while str_start held indices into str_pool indicating the begin- 
ning of each string. This has all been replaced with a Collection<wstring*>, 
where wstring is the STL string for 4-byte characters. As a result, we can di- 
rectly take advantage of the hashing facilities provided in the STL. Note that the 
omega. pool file generated by tangle has been transformed into a C-I--I- file. 

4.6 The Input Buffer 

The Tj 5]X implementation used a single array buffer, holding all the active lines, 
concatenated. This has now been broken up into a Collection of string streams. 
This setup simplifies the programming of UTPs, which must add to the input 
buffer while a line is being read. 
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4.7 The Table of Equivalents 

The table of equivalents holds the values for the registers, the definitions for 
macros, and the values for other forms of globally accessible data. The 
implementation used three arrays: eqtb held all of the potential equivalent en- 
tries, hash mapped string numbers to equivalent entries, and hash_used was an 
auxiliary Boolean table supporting the hashing. 

The table has now been broken into several tables map<unsigned,Entry*> 
(for characters or register numbers) or map<wstring,Entry*> (for macro defini- 
tions), where Entry is some kind of value. Support is provided for characters up 
to 0x7fffffff, and the STL hashing capabilities are used. This infrastructure 
has been built using the intense library [9], thereby allowing each Entry to be 
versioned, allowing different definitions of a macro for different contexts. 



4.8 Fonts and fJTPs 

In terms of numbers of lines written, most of the new code in fl is for handling 
fonts and flTPs. However, because we are using standard OO technology, it is 
also the most straightforward. 

The original and code for fonts was concerned mostly with bit pack- 
ing of fields in the .tfm and .ofm files, and unpacking this information inside 
the typesetting engine whenever necessary. This approach was appropriate when 
space was at a premium, but it created very convoluted code. By completely sep- 
arating the font representations in memory and on disk, we have been able to 
provide a very simple OO interface in the character-level typesetter of the fl en- 
gine, greatly simplifying the code for ligatures and kerning inside the typesetter, 
as well as for the font conversion utilities. 

Similarly, for the flTPs, filters can be implemented as function objects over 
streams using iterators, tremendously simplifying the code base. 

5 Supporting OpenType 

Since we are using a programming language supporting type hierarchies, it is 
possible to support many different kinds of font formats. In this section, we con- 
sider different options for supporting OpenType, the current de facto standard. 

The OpenType font format has been officially available since 1997. Unlike its 
predecessors, TrueType and PostScript Type 1 and 2, it facilitates handling of 
LGC (Latin-Greek-Cyrillic) scripts and also provides essential features for proper 
typesetting of non-LGC ones. Competing formats with similar capabilities (Apple 
Gx/aat and Graphite) do exist, but the marketing forces are not as strong. 

At the EuroTgX conference in the summer of 2003, we presented our first 
steps towards an OpenType-enabled U system. At the time, OpenType and U 
were just flirting, but since last year their relationship has become more and 
more serious. In other words, what began simply as the adaptation of O to 
OpenType fonts has now become a larger-scale project: the authors are planning 
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to restructure O’s font system and make OpenType a base font format. As it will 
be shown, full OpenType compatibility requires serious changes inside both O 
and odvips. The other goal of the project is to simplify the whole font interface, 
eliminating the need for separate metric files, virtual fonts and the like (while 
the old system will of course continue to be supported) . 

Such a project, however, will certainly need some time to finish. Fortunately, 
the work done until now already provides users with the possibility to typeset 
using OpenType fonts, even if only a limited number of features are supported. 
It will be shown below that further development is not possible without major 
restructuring of the O system. Nevertheless, the present intermediate solution is 
in fact one of the three that we will retain. 

Before getting to the discussion of possible solutions, let us briefly present the 
most important aspects of OpenType and their implications for O development. 

5.1 OpenType vs. Omega 

The key features of the OpenType format are summarised in the list below. As 
each one of these features raises a particular compatibility issue with O, they 
will all be elaborated below. 

1. Font and glyph metric information; 

2. Type 2 or TrueType glyph outlines (and hints or instructions); 

3. Advanced typographic features (mainly GSUB and GPOS); 

4. Clear distinction between character and glyph encodings; 

5. Pre-typesetting requirements; 

6. Extensible tabular file format. 



Font and Glyph Metrics. OpenType provides extensive metric information 
dispersed among various tables (post, kern, hmtx, hdmx, OS/2, VORG, etc.), both 
for horizontal and vertical typesetting. Although in most cases O’s and Open- 
Type’s metrics are interconvertible a few but important exceptions do exist (e.g., 
height/depth) where conversion is not straightforward. See [1,4]. 

Glyph Outlines, Hints and Instructions. Since the OpenType format it- 
self is generally not understood by PostScript printers, a conversion to more 
common formats like Type 1 or Type 42 is necessary. As explained in [1], to 
speed up this conversion process, we create Type 1 charstring collections using 
our own PFC tables which are used by odvips to create small, subsetted Type 1 
fonts (a.k.a. minifonts) on the fly. This solution, on the other hand, does not 
preserve hints nor instructions, at least not in the present implementation. We 
are therefore planning to also provide Type 42 support for TrueType-flavoured 
OpenType. This solution would allow us to preserve instructions, at the expense 
of subsetting and compatibility. 
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Advanced Typographic Features. These are perhaps the most important as- 
pect of OpenType. Its GSUB (glyph substitution) and GPOS (glyph positioning) 
tables are essential for typesetting lots of non-LGC scripts. In fl, the equivalent 
of GSUB features are the flTP’s: they can do everything GSUB features can, in- 
cluding contextual operations. Glyph positioning is a different issue: since the 
fITPs are designed for text rearrangement (substitutions, reordering etc.), they 
are not suitable for doing glyph placement as easily. Context-dependent type- 
setting microengines for character-level typesetting have been proposed for Q. to 
provide modular, script- and language-specific positioning methods, along the 
lines of f2TP files; however, they have yet to be implemented. The positioning 
features in OpenType GPOS tables are in fact the specifications for microengines. 

Character and Glyph Encodings. The above discussion of advanced typo- 
graphic features brings us to a related issue: the fundamental difference between 
O’s and OpenType’s way of describing them. Although both O and OpenType 
are fully Unicode compatible, OpenType’s GSUB and GPOS features are based on 
strings made of glyph id’s and not of Unicode characters. As for O and some of 
its OTP’s, tasks such as contextual analysis or hyphenation are performed on 
character sequences and the passage from characters to “real” glyph id’s happens 
only when odvips replaces virtual fonts by real ones. To convert a glyph-based 
OpenType feature into a character-based OTP would require O to offer means 
of specifying new “characters” (the glyph id’s) that do not correspond to any 
Unicode position. The conversion itself would not be difficult since O’s possi- 
ble character space is much larger than Unicode’s. This, however, would lead 
us to glyph ID-based, hence font-specific, OTP’s and hyphenation, which is not 
a lovely prospect, to say the least. To solve this problem, it will certainly be 
necessary to keep both character and glyph information of the input text in par- 
allel during the whole typesetting and layout process. This dual representation 
of text is also crucial for the searchability and modifiability of the output (pdf, 
PS, SVG or any other) document. 

Pre-typesetting Requirements. OpenType relies on input text reordering 
methods for its contextual lookups to work correctly. If U is to use the same 
lookups, these reordering methods must also be implemented, either by OTP’s 
or by an external library. 

Extensibility. Finally, the OpenType format has the important feature of being 
extensible: due to its tabular structure, new tables can be added into the font 
file, containing, for example, data needed by O with no OpenType-equivalents 
(like metrics or pfc charstrings, see below). However, it is necessary that the 
given font’s license allow additions. 

5.2 Solutions 

From the above discussion it should now be clear that complete and robust 
OpenType support is not a simple patch to O and odvips. Three solutions are 
proposed below, in order of increasing difficulty and of our working plan. 
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1. Convert OpenType fonts into existing O font metrics and OTP’s; 

2. Provide built-in support within O for a fixed but extensive set of OpenType 

features and read data directly from the OpenType font file; 

3. Finally, provide extensible means for using the full power of OpenType fonts. 

The Current Solution. This, described in detail in the EuroTJilX article [1], 
corresponds to the first solution. Here we give a short summary. 

The initial solution was based on the approach that OpenType fonts should 
be converted to O’s own formats, i.e., .ofm (metrics), .ovf (virtual fonts) and 
OTP. Anish Mehta wrote several Python scripts to generate these files, of which 
the most interesting is perhaps the one that converts the whole OpenType GSUB 
table into OTP’s. Type 2 and TrueType outlines themselves are converted into 
the Type 1-based pfc format and are subsetted on the fly by a modified odvips. 

In summary, the present solution is a working one. Admittedly far from being 
complete (gpos support is missing, among others), it is intended to provide O 
users with the possibility to typeset using OpenType fonts, including even some 
of its advanced features, while further development is being done. 

Future Solutions. The second and third solutions mentioned above require 
that the O engine be capable of directly reading OpenType fonts, which can be 
done using a public library such as freetype or Kenichi Handa’s libotf . This 
would also eliminate the need to create . ofm and . ovf files. 

Providing built-in support for a fixed set of features corresponds to the afore- 
mentioned microtypesetting engines. For a given set of features, a new engine 
can be written. This approach can be taken using standard OO techniques. 

A more general approach requires the ability to reach into an OpenType font, 
reading tables that were not known when the O engine was written. For this to 
work requires some kind of programming language to be able to manipulate 
these new tables. A simple such language is Handa’s Font Layout Tables [2]. 

It should be clear that these solutions are not mutually exclusive and that 
backwards compatibility with the classic font system will be maintained. 

6 Conclusions 

At the time we are writing, this work is not completely finished. Nevertheless, it 
is well advanced: the infrastructure is substantially cleaned up, and is extensible, 
with clear API’s. Detailed documentation will be forthcoming on the H website. 

If we view things in the longer term, we are clearly moving forward with two 
related goals, the stabilisation of existing H infrastructure, and abandonment of 
the T(j;X infrastructure for the design and implementation of a next-generation 
open typesetting suite. 

Such a suite should be a generic framework with an efficient C-I--I- core, that 
is universally extensible through a number of well-known scripting interfaces, 
for example, Perl, Python, and Guile. Implementation of libraries similar to the 
popular DTjjjX suite could then be done directly in C++, on top of the core API, 
or as a linked-in C++ stream filter. 
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Abstract. The multilingual support of presents many weak 

points, especially when a language does not present the same overall 
syntactic scheme as English. Basque is one of the official langnages in 
the Basque Country, being spoken by almost 650,000 speakers (it is also 
spoken in Navarre and the south of France). The origins of the Basque 
language are unknown, ft is not related to any neighboring langnage, 
nor to other fndo-Enropean languages (such as Latin or German) . Thns, 
dates, references and nnmbering do not follow the typical English pat- 
tern. For example, the numbering of fignre prefixes does not correspond 
to the \f igurenameXthef igure structure, but is exactly the other way 
round. To make matters worse, the presence of declension can turn this 
usually simple task into a nightmare. This article proposes an alternative 
structure for the basic classes, in order to support multilingual documents 
in a natnral way, even in those cases where the languages do not follow 
the typical English-like overall structure. 



1 Introduction 

The origins of fArgX are tied closely to the English language. Since those days, 
however, it has spread to many different languages and different alphabets. The 
extent of the differences among these languages is not only related to lexical 
issues, but to the structure of the languages themselves. 

The main problem arises when the syntactic structure of the language does 
not follow the English patterns. In these cases the adoption of a new multilingual 
approach is required in order to produce documents for these languages. 



A. Syropoulos et al. (Eds.): TUG 2004, LNCS 3130, pp. 27—33, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Although is a highly parameterizable environment, it lacks resources 

to alter the order of the parameters themselves. This is due to the fact that both 
Germanic languages (such as English and German) and Romance languages 
(such as French, Italian, Spanish) - and therefore the most widely spread Eu- 
ropean research languages that use the Latin alphabet - share a similar word 
order for numeric references. To make matters worse, the presence of declension 
in structure such as dates and numbers leads to a complicated generalization of 
procedures. 

This paper describes an alternative structure for the basic classes, in order to 
support multilingual documents in a natural way, even in those cases where the 
languages do not follow the typical English-like overall structure. Specifically, the 
paper focuses on Basque, one of the official languages in the Basque Gountry, 
being spoken by over half a million speakers (it is also spoken in Navarre and 
the south of France). 

The rest of the paper is organized as follows: section 2 describes the specific 
details of the Basque language, in section 3 a brief description of prior work is 
presented, section 4 describes the different approaches that can be followed to 
solve the problem, section 5 shows the advantages and drawbacks of the different 
solutions and finally, in section 6 some brief conclusions are presented. 



2 Specific Details of the Basque Language 

The origins of the Basque language are unknown. It is not related to any neigh- 
boring language, nor to other Indo-European languages (such as Latin or Ger- 
man). This is one of the reasons why word order and numbering schemes are 
different from those in English. 

Dates and numbers. Basque uses declension instead of prepositions as many 
other languages. The main difference from other languages that use declension, 
such as German, is that in Basque numbers are also fully declined, even in 
common structures such as dates. These declensions depend not only on the 
case, number and gender, but on the last sound of the word. Another peculiarity 
of Basque is the use of a base 20 numerical system instead of the traditional 
decimal one. 

This forces us to take into account not just the last figure of the number 
but the last two figures, in order to determine the correct declension for the 
number [3]. In the following example, two dates are represented using ISO 8601 
and its translation into Basque. 

2004- 01-11 : 2004ko urtarrilaren lln 

2005- 01-21 : 2005eko urtarrilaren 2 lean 

Note that although both days end in the same figure, the declension is slightly 
different. The same happens to the year. The extra phonemes have been added 
to avoid words that are difficult to pronounce. This makes automatic date gen- 
eration difficult, because it must take into account all the possible cases (as 
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Table 1. Endings. 



Number 


Ending (year) 


Ending (day) 


00 


ko 


- 


01 


eko 


ean 


02 


ko 


an 


03 


ko 


an 


04 


ko 


an 


05 


eko 


ean 


06 


ko 


an 


07 


ko 


an 


08 


ko 


an 


09 


ko 


an 


10 


eko 


ean 


11 


ko 


n 


12 


ko 


an 


13 


ko 


an 


14 


ko 


an 


15 


eko 


ean 


16 


ko 


an 


17 


ko 


an 


18 


ko 


an 


19 


ko 


an 


20 


ko 


an 



base 20 is used, there may be as many as 20 different possibilities). The different 
number endings are shown in table 1. Note that there are only twenty possible 
terminations, and two declension classes are necessary. 

Word order. When numbering a certain chapter, section, etc., in English-like 
languages the order is always the following: first, the item class (e.g. “figure”) is 
named and, afterwards, the number is written. For example, we have “Figure 1.1” 
or “Table 2.3” . However, this is not the case in Basque. In this language, we must 
reverse this order: “1.1 Irudia” or “2.3 Taula”. The same applies for chapters, 
sections and other kind of text partitioning structures. 

3 Related Work 

Multilingual support for FT[;]X is traditionally performed using the Babel pack- 
age [2]. In this package, the overall structure of documents, such as books, ar- 
ticles, etc., is fitted to different languages by using different variables for the 
different strings in each language. 

For example, we can take the way figure captions are numbered in these types 
of documents: a variable called \figurencune contains the string corresponding 
to the word “figure” in the first part of the caption, while another variable, 
\thef igure contains the number assigned to that caption. When a new figure is 
inserted in the document, the string preceding the caption is always formed by 
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using a concatenation of both variables. However, this process is not performed 
by Babel, which would allow a general description of the language, but in the 
different files that describe the document format: book. els, article, els, etc. 
Thus, some of the work that should be performed by the module in charge of the 
multilingual support is made by the formatting part of the typesetting software. 

The file basque . Idf [1] currently provides support for Basque in Babel. In 
this file, the most commonly used words have been translated. However, this does 
not solve the problem of the different order of strings. In [I], a possible solution 
is proposed using a new package for the document definition: instead of using 
the multilingual capabilities of Babel to solve the problem, a new document 
formatting file is defined, where the specific corrections for the language are 
performed. The limitation for multilingual document generation is obvious in this 
scheme: the format must be redefined whenever the language of the document 
is changed. Besides, a new definition for every single class of document must be 
performed for this particular language - as we are not philologists, we do not 
know if the same happens in other languages. 



4 Approaches to the Solution 

The solution to the problem described in this paper must deal with the following 
issues: 

— It must respect all the translations of the different strings generated auto- 
matically. 

— It must respect not only the translation, but the order of words as well. 

— The last problem to solve is the use of the \selectlanguage directive, which 
would allow us to change the hyphenation patterns and the automatic text 
generation structures dynamically in the same document. This directive is 
particularly useful for documents which contain the same text in different 
languages (e.g. user’s guides, where the manual has been translated). 

The main possible avenues to the solution are the following: 

— Use of specific classes for the language: This solution implies the redef- 
inition of every document format, in order to embed the corresponding word 
order alteration for automatic string generation. The main drawback of this 
alternative is the need for rewriting and adapting all the existing document 
formats. 

— Use of a specific package for the language: A second possibility could 
include the definition of a new package for those languages that require 
a word order alteration. This package should redefine the \fnum@ figure 
and the \fnun@table variables (among others, which define the chapter or 
section name) in order to adapt them to the needs of the languages used. A 
macro should be used to switch between the two nodes. 

Inclusion of order parameters in the document class definition files: 
This option requires that a new input parameter is defined in the document 
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class to define the order of the words. Basically, it is the same solution as 
the first one, but merging all the different files for a document class into a 
single (larger and more complex) file. 

Redefinition of existing multilingual support files: This solution im- 
plies the addition of several lines to every language support file, where the 
definition of the automatic strings such as the figure captions or the table 
captions is performed. For example, for the case of table and figure captions, 
the definitions for the Basque language would be the following: 

\def igure{\thef igure~\f igurename} 

\def \f mim@t able{\thetable~\t ablename } 

These definitions should go into the basque . Idf file, immediately after the 
definition of the terms for caption or table names. Thus, whenever a \se- 
lectlanguage directive is introduced in the document, the Babel package 
will read the definitions for the new language, which will include the defini- 
tions for every string. 

5 Comparison of Solutions 

We use the following criteria to compare the different solutions: 

— Extent of modification to existing files: This criterion measures how 
many existing files will be altered to fix the problem and how complicated 
this alteration is. 

— Addition of new files: This criterion measures how many new files are to 
be added to the IATeX distribution for each solution. 

— The \selectlanguage issue: This criterion measures how well the solution 
deals with possibly changing the language of the document dynamically. 
How easily new automatically-generated strings are included: In the 
future, translation of new strings may be required. Therefore, the proposed 
solution must provide an easy way to include these new strings. 

5.1 Extent of Modification 

Here is how the solutions fare with respect to the first criterion: 

— Use of specific classes for the language: This option does not require 
that any file be modified, because new definitions are described in new files. 

— Use of specific package for the language: This approach requires no 
modifications of existing files, since all modifications are included in a new 
package. 

Inclusion of order parameters in the document class definition 
files: This alternative entails the redefinition of every document class. These 
should admit a language parameter to determine the correct word order. 

— Redefinition of existing multilingual support files: This choice implies 
that every file containing the translation and definition of the automatically- 
generated strings provides order information for them, and therefore, all the 
files in Babel should be changed. 
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5.2 Addition of New Files 

Here’s how the solutions fare with respect to adding new files: 

— Use of specific classes for the language: This option requires all doc- 
ument classes to be rewritten for every language that does not follow the 
English- like structure. 

— Use of specific package for the language: This approach requires one 
new file for every language that has not been described successfully in the 
Babel approach. 

Inclusion of order parameters in the document class definition files: 

This alternative entails no new files, as it is based on the modification of the 
existing files. 

Redefinition of existing multilingual support files: This choice does 
not need new files, as it is based on the modification of the existing files. 

5.3 The \selectlanguage Issue 

Depending on how generalization of the multilingual support is implemented, 

the different solutions may (or not) solve the \selectlanguage problem: 

~ Use of specific classes for the language: This option does not really 
use Babel and its macros. As part of the translation of automatic strings is 
performed by the file defining the format of the document class, support for 
the \selectlanguage directive should be implemented in each document 
class for every language (not only for those incorrectly supported by the 
Babel system, but for all of them). 

— Use of specific package for the language: This approach requires one 
new file for every language. Hence, a macro would be required in each package 
to leave things as they were before the package was initiated. 

Inclusion of order parameters in the document class definition files: 
This alternative cannot solve the problem, because the order specification 
is only made at the beginning of the document. A macro could be added 
to alter its value dynamically throughout the document, but it would be an 
artificial patch that would not fit naturally in the Babel structure. 
Redefinition of existing multilingual support files: This choice does 
solve the problem, because when a new \selectlanguage command is is- 
sued, the definitions for the new language are reloaded. It requires no new 
macro definitions to suit the Babel scheme for multilingual documents. 

5.4 Inclusion of New Strings 

Here’s how the solutions fare with respect to the possibility of including further 

modifications for strings that could be necessary in the future: 

— Use of specific classes for the language: As some of the linguistic char- 
acteristics of the document are included in the document class, this option 
does not provide a straightforward method for including changes for prob- 
lems that may arise. 
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— Use of specific package for the language: The use of a package gives 
flexibility to the scheme, allowing the insertion of new macros to adapt to 
the peculiarities of the language. However, the range of possibilities is so 
wide that a very well-defined structure must be laid down in order to keep 
a modicum of coherence for creating a document in a different language. 
Inclusion of order parameters in the document class definition files: 
This scheme requires updating several files whenever a new string or scheme 
must be added. 

Redefinition of existing multilingual support files: As this choice uses 
a single file for every language, it makes updating the elements for Babel very 
easy. 

6 Conclusions 

This paper discusses some alternatives to solve the ordering problems that may 
arise in multilingual documents. 



Table 2. Solution comparison. 



Solution 


Mod. 


Cr. 


Multi. 


Updates 


Specific class 


X 


/ 


Dif. 


Dif. 


Specific pack. 


X 


/ 


Dif. 


Dif. 


Parameters 


/ 


X 


Dif. 


Dif. 


Redefinition 


/ 


X 


/ 


/ 



The characteristics of the different proposed solutions are summarized in 
table 2. Among the solutions, the most suitable would be the redefinition of all 
the existing Babel files. The reason is simple: it requires the addition of two lines 
to approximately 45 files, and allows the update of the system in the future, as 
it maintains all the translating issues within their natural context (Babel). 
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Abstract. This paper presents a successfully tested method for the 
automatic conversion of monotonic modern Greek texts into polytonic 
texts, applicable on any platform. The method consists of combining 
various freely available technologies, which have much better results than 
current commercially available solutions. The aim of this presentation is 
to introduce a way of applying this method, in order to convert thousands 
of digitally available single-accented modern Greek pages into attractive 
artworks with multi- accented contents, which can be easily transferred 
either to the Web or a TJrjX-friendly printer. We will discuss the prepara- 
tory and postprocessing efforts, as well as the editing of syntax rulesets, 
which determine the quality of the results. These rulesets are embedded 
in extendable tables, functioning as flat databases. 



1 Introduction 

During the past centuries, Greek and Hellenic scholars have introduced and 
refined polytonism (multiple accenting) in the written word for the precise 
pronounciation of ancient Greek. Since spoken modern Greek is comparatively 
less complicated, the Greek government has officially replaced polytonism by 
monotonism (single accenting) for purposes of simplification, especially in the 
educational system. Also, Greek authors commonly use monotonism, since it is 
so much simpler to produce. 

Glassical, polytonic, Greek has three accents (acute, grave, and circumflex) 
and two breathings (rough and smooth - equivalent to an initial ‘h’ and lack 
thereof). Accents are lexically marked, but can change based on other factors, 
such as clitics (small, unstressed words that lean on another word to form a 
prosodic word - a single word for accent placement). In addition, two other 
symbols were used: diaeresis (to indicate two vowels that are not a diphthong) 
and iota subscript (a small iota that was once part of a diphthong but subse- 
quently became silent). 

Monotonic Greek retains only the acute accent, which was usually, though 
not always, the same as the classical acute. To make a graphic break with the 
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past, the new acute accent was written as a new tonos glyph, a dot or a nearly 
vertical wedge, although this was officially replaced by a regular acute in 1986. 

So, why bother with the complexities of polytonism? The benefits are in- 
creased manuscript readability and, even more important, reducing ambiguity. 
Despite the simplification efforts and mandates, the trend nowadays is back to the 
roots, namely to polytonism. More and more publishers appreciate, in addition 
to the content, the public impression of the quality of the printed work. 

This paper discusses an innovative and flexible solution to polytonism with 
an open architecture, enabling the automatic multiple accenting of existing 
monotonic Greek digital documents. 

2 Terminology 

In this article, we will use the terms polytonism and multiple accenting inter- 
changeably to mean the extensive usage of spiritus lenis, spiritus asper, iota 
subscript, acute, gravis and circumflex. Similarly, we use the terms monotonism 
and single accenting to mean the usage of simplified accenting rules in Modern 
Greek documents. 

3 Historic Linguistic Development 

During the last four decades the printed Greek word has undergone both minor 
and radical changes. Elementary school text during the late 1960s and early 
1970s made Purified Greek (xa'dapsuouoa), by strict government law of the time. 
The mid-1970s saw a chaotic transition period from Purified Greek to Modern 
Greek (6r]poTixt]) with simplified grammar, where some publications were printed 
with multiple accenting, some with single accenting and even some without any 
accenting at all! 

Even after the government officially settled on monotonism in the early 1980s, 
Greek publishers were not able to switch immediately to the monotonic system. 
During the last decade, many computerized solutions have been invented for 
assistance in typing monotonic Greek. Today, there is a trend toward a mixture 
of simplified grammar with multiple accenting, decorated with Ancient Greek 
phrases. See Table 3. 

4 Polytonic Tools 

There are two programs for Microsoft Word users, namely TONISMOS by 
DATA-SOFT and ATTOMATOS nOAYTONISTHS (academic and profes- 
sional version) by MATZENTA. A third is the experimental povo2tto\v, an 
open source project, which is the subject of this discussion. 

These solutions are independent. The major difference between the commer- 
cial and open source programs is the control of the intelligence, such as logic, 
rule sets and integrated databases. In the case of the commercial solutions, users 
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depend on the software houses; in the open source case, users depend on their 
own abilities. See Table 2. 

There is no absolutely perfect tool for polytonism, so the ultimate choice is 
of course up to users themselves. 
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5 Open Source Concept 

fj.oi'o2TToXv implements a modular mechanism for multiple accenting of single- 
accented Greek documents. See Figure 1. 

5.1 Architecture 

The fj,oi>o2TToXv architecture consists of (Figure 2): 

— methods: DocReader, DocWriter, 

DBParser, Converter 

— configuration file: *.cfg 



0 0 0 0 0 ) 
cC 



XI 

0 



hJ 



.a s ^ X 



D. 0 



D. 

i o ' 

rt m 1 



o o 
< < 






=5 'o 
X ^ ^ 

^ -d' 

O S 0 

o C/D 

O .a ® 



H - 

S 8 

- O 
GO CN 

g S s s ^ 



0 
o 3 o 3 



H 

K 



o 

o 

c. 

'o 



W 
K 
W H 
O W 

^ O 
§ H 

O b! 
t-' 

<1 O 
St C 



><! ^0 

^ Pi 0 > 

-Q Id 

4 J 'tj TO 

0 Si 

00O000X0><'-Ori0Q 
>jyo! 0.5 



9 "0 9 "0 
2 k 2 



o ^ X m 



^ o > 






H 



-H 



o 



£ 

H 



W 

o 

s 

w 

I— I 

X 

o 

H 



m CO CO 



0 

a 

a 

s 

c/3 

D 

bC 

CC 

S 

bC 

cd 



H 



D 0 

«< E 






^ ^ CM 

N P Si 

m o Ph 



V. ■ 



p 



rP 0 -M 
X id 



bC cd 
fl X 
cd 






X ^ 

H Q 



p 

p 

p 



p 

.2 

§ 

a 

I E 

' .2 s 

' -S .2 

td jj 



■g d "g 

lElE 



: S « 

<! a 



o _0 



H 



-H 

2 C 




a 3 .a 

a .9 d 

--f- P 



Availability immediately immediately under development 

Distribution purchased license purchased license open source 
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Fig. 1. Overview of the overall multiple accenting concept, which involves many 
external tools. 
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Fig. 2. Overview of internal architecture, with external interfaces to existing standards. 




Fig. 3. External creation of monotonic Greek with any word processor (e.g., OpenOf- 
fice) using a separate spellchecker (e.g., elspell). 



— flat database: *.xml 

— document type deflnition: * . dtd 

— optional spreadsheet: *.csv, *.xls 



5.2 Configuration 

The plain text configuration file defines (gray arrows in Figure 2) the necessary 
filenames and pathnames. The iioi'o2ttoXv components read this during initial- 
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ization to determine where to find the input files and where to write the output 
files (by default, the current working directory). 

5.3 Database Connectivity 

The dotted arrows in the architecture figure (Figure 2) show the connection 
between a CSV spreadsheet, a Document Type Definition (DTD), the actual 
XML fiat database, and the database parser. 

5.4 Input Prerequisites 

During the conversion process, invisible special control codes for formatting 
features (superscript, bold, etc.) make it difficult to coherently step through 
paragraphs, sentences and words. Therefore, plain text files serve best for poly- 
tonism input. 

The DocReader component of fj,ovo2'KoXv expects the source document to be 
in the ISO 8859-7 encoding, and to be written according to the Modern Greek 
grammar, especially regarding the monotonic accenting rules. 

Assistance for monotonic accenting while typing Modern Greek documents is 
provided by Microsoft’s commercially bundled spellchecker, or any downloadable 
Open Source spellchecker. 



5.5 Converter 

The bold arrows in the architecture figure (Figure 2) show the data exchange 
between the internal components, the document reader, the database parser and 
the document writer to the converter. The conversion process does not include 
grammar analysis, since ^ovo2t:o\v expects that monotonic proof reading has 
been done previously, with other tools. 



6 External Interfaces 

The output from plovo2ttoXv (DocWriter method) is in the ISO 10646-1 en- 
coding, in various formats, which are then post-processed. The dashed arrows in 
Figure 2 show the relationship between the external files. 

6.1 Web Usage 

For background, these web pages discuss polytonic Greek text and Unicode^ 
(UTF) fonts: 

^ Concerning missing Greek characters and other linguistic limitations in Unicode, 
see Guidelines and Suggested Amendments to the Greek Unicode Tables by Yannis 
Haralambous at the 2U* International Unicode Conference on May 2002 in Dublin, 
Ireland (http: //omega . enstb . org/yaimis/pdf /amendment s2 .pdf ). 
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Fig. 4. Example of original monotonic input. 

— http : //www . ellopos . net/elpenor/lessons/lesson2 . asp 

— http://www.stoa.org/unicode/ 

— http : //www.mythf olklore .net/aesopica/aphthonius/1 .htm 

Using the direct HTML polytonic output from ^ovo2ttoXv requires that the 
layout of the web page be done in advance, since manually editing the numeric 
Unicode codes in the * . html file is impractical (see figure 5) . Dynamic web pages 
created through CGI scripts, PHP, etc. have not yet been tested. 



<font 

FACE=”Arial Unicode MS" 
SIZE="36"> 

Ἢ  

τὸ  

&#7957 ; &#957 ; &#945 ; &#32 ; 

ἢ  

τὸ  

&#7940 ; &#955 ; &#955 ; &#959 ; 

..&#46 

</ font> 

Fig. 5. Example of polytonic HTML output. 



6.2 OpenOfRce Usage 

The ISO 10646-1 encoded polytonic output (Figure 6) from ^ouo2noXv could 
be inserted into the OpenOffice Writer software, since the newest version can 
directly output polytonic Greek .pdf files. Unfortunately, the quality of the 
result leaves much to be desired. Better results can be produced by converting 
from Writer to DTeX and doing further processing in the UTf^X environment. 
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Fig. 6. Example of polytonic output. 



6.3 Usage 

The most likely scenario for (U)T[5]X users is using the Greek babel package, and 
adding the ^oi'o2'koXv 7-bit polytonic output text into the source .tex file. See 
figures 7 and 8. 

The 7-bit output from ^ovo2t:oXv could presumably also be inserted into 
.fo files, and processed through PassiveTEX, but this has not yet been tested. 
Likewise, the ISO 10646-1 output could presumably be processed directly with 
n/A, but this has not been tested, either. 




Fig. 7. Example of polytonic TeX output, either from ^ovo2noXv or Writer2IAtl5X. 
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Fig. 8. Polytonic PDF output from 



7 Technologies Used in ^iouo2tzo\v 

After some evaluation, we chose to focus on Java, Unicode and XML, due to 
their flexibility in processing non-Latin strings, obviously a critical requirement 
of ^,ovo2TroXv. 



7.1 Programming Language 

Two major reasons for choosing Java (J2SE) as the implementation language 
of fj,ovo2TToXv were the capabilities for handling XML and Unicode through 
widely-available and well-documented libraries. The Java SDK provides ex- 
tremely useful internationalization features, with the ability to easily manipulate 
string values and flies containing wide characters. 

In order to concentrate on ^ouo2ttoXv's, essential features, no graphical user 
interface has been designed. 



7.2 Character Set 

The choice of Unicode/ISO 10646-1 for the character set should be clear. 
It combines monotonic and poly tonic Greek letters, is known worldwide and 
standardized on most platforms, and contains most (though not all) variations 
of Greek vowels and consonants, in the Greek and the Greek Extended tables^. 

For further information on writing polytonic Greek text using Unicode, see 
http : //www. stoa. org/unicode/. 

http : //www. Unicode . org/versions/Unicode4. 0 . 0/ ch07.pdf 
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7.3 Text Parsing Libraries 

Most helpful for the parsing of XML-based database entries are the SAX and 
DOM Java libraries. 

The following Java source code, taken from the plovo2ttoXv class DBparse, 
serves to demonstrate usage of SAX and DOM. The code counts and then 
outputs the total amount of all available entries in the XML database file. 

[\scriptsize] 

import java.io.*; 

import org.xml.sax.SAXException; 

import org . xml . sax . SAXParseException ; 

import j avax . xml . parsers . DocumentBuilder ; 

import j avax . xml . parsers . DocumentBuilderFactory ; 

import j avax . xml . parsers . FactoryConf igurat ionError ; 

import j avax . xml . parsers . ParserConf igurat ionExcept ion; 

import org.w3c.dom.*; 

public class DBparsef 

static Document document; 

String warn="No XML database filename given..."; 
public static void mainCString param[]){ 
if (parcun. length !=!){ 

System. out .println(warn) ; 

System. exit (1) ;} 

File mydbfile=new File (param [0] ) ; 
boolean load=mydbf ile . canReadO ; 
if (load){ 
try{ 

DocumentBuilderFactory fct 

= DocumentBuilderFactory. newlnstanceO ; 
DocumentBuilder builder 

= f ct .newDocumentBuilder 0 ; 
document = builder. parse(mydbfile) ;} 
catch (SAXParseException error) { 

System. out .println("\nParse error at line: " 

+ error .getLineNumberO + " in file: " 

+ error . getSystemldO ) ; 

System. out .println("\n" + error . getMessage () );} 
catch (ParserConf igurationException pee) 

{pee .printStackTraceO ;} 
catch (lOException ioe){ioe. printStackTraceO ;} 
catch (Throwable t){t .printStackTraceO ;}} 
elsefSystem. out .printlnC'XML database missing!");} 

String mytag=’ \u03C3 ; 

NodeList taglist=document . getElementsByTagNaune (mytag) ; 
int cunount=taglist . getLengthO ; 

Sy stem. out. printlnC amount of entries:\n" + aunount );}} 
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Notice particularly the fourth-last line, where my tag is assigned ’ \u03C3 ’ , namely 
the character cr, used as the search string. 



8 Database Structure 

The XML standard from the W3C has proven to be a simpler choice for 
storing either monotonic or polytonic Unicode text than the alternatives, such 
as spreadsheets or even SQL databases. The quality of the final polytonic result 
depends on the precision of the XML content, where ambiguities have to be 
marked with special symbols for manual post-processing. 

Currently, the entries of the basic database consist of tags with parameters 
and values. The tag name indicates the type of the expression: a single character, 
a prefix, a suffix, a substring, a word or a chain of words. The five parameters 
are as follows: 

1. The monotonic ISO 8859-7 encoded source expression to be converted. 

2. The Unicode output text. 

3. A 7-bit output text for (U)T[;]X usage with the Greek babel package. 

4. The equivalent numeric value according to the Extended Greek Unicode table 
for HTML usage. 

5. An explanatory comment or example, in case of ambiguities or linguistic 
conflicts. 

Here, I have built on the work of prior Greek TgX packages, such as GreekTgX 
(K. Dryllerakis), Scholar TgX (Y. Haralambous) , and greektex (Y. Moschovak- 
is and G. Spiliotis), for techniques of using the iota subscript, breathings and 
accents in 7-bit transliterated . tex source files. 

In the following examples, note carefully the different bases used: ’074 is 
octal, #8172 is decimal and ‘03D1’ is hexadecimal. 



8.1 Data Type Definition 

The required basic Data Type Definition is currently located in the exper- 
imental namespace xmlns;p = http://koti.welho.com/ilikos/TeX/LaTeX/ 
mono2poly/mono2poly . dtd. It contains the following information: 

<! ELEMENT p (a+)> 

<! ELEMENT a (#PCDATA)> 

<!ATTLIST a 

p CDATA #REQUIRED 
TZ CDATA #REQUIRED 
T CDATA #REQUIRED 
8 CDATA #REQUIRED 
^ CDATA #REQUIRED> 
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Thus, we have one element, called p (pdar] = database). It contains 

multiple element sets, called a {oTOV/e la auXX«p r]c; = syllable data). Each 
element set has, at present, five attributes, namely p for monotonic expressions, 
Ti for polytonic expressions, t for 7-bit (E^)T[;]X code, 6 for HTML code, and 
finally 5 for comments. 

The DTD can be overridden by a local . dtd file, which must be specified in 
the header of the .xml database file; for example: 

<!D0CTYPE p SYSTEM "my_own_mono2poly . dtd"> 

Both the . dtd and . xml must reside in the same directory. 



8.2 Data Entries 

Here is an example database entry, showing the only Greek capital consonant 
with spiritus asper: 



<a 

p="P" 

n="‘P" 

T="\char’074 R" 

6="Ῥ " 

?="‘P6Soc" 

></a> 

The slash symbol indicates the closing element tags in XML, while the back- 
slash symbol is used for (D)T5iP^ commands. Both appear in the .xml database 
file. 



8.3 Header and Body 

Although not explicitly documented, exotic characters may be used in .dtd and 
. xml files as long as the appropriate encoding is declared: 

<?xml version=" 1 . 0" encoding="UTF-16"?> 

The header should include other information as well. Schematically: 

<?xml version="l . 0" encoding="UTF-16"?> 

<!D0CTYPE p SYSTEM "mono2poly.dtd"> 

< ! — author : ... — > 

<! — affiliation: ... — > 

<! — creation date: ... — > 

< ! — notes : ... — > 

<P> 

<a p="..." n=".." T=".." 8=".." "></a> 

<o p="..." n=".." T=".." 8=".." "></c> 

</p> 
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For quality assurance, after database creation and after each update a val- 
idation and verification test should be run, to detect XML syntax errors and 
linguistic content mistakes. 

This concept of the database as a lookup/mapping table allows differentiating 
between initial and intermediate consonants. For example: 

6 (03D0) ^ ^ (03B2) 

0(O3D1) ^ 9 (03B8) 

Q (03F1) ^ p (03C1) 
f (03D5) ^ (j) (03C6) 

Therefore, by updating the XML file, post-processing may be reduced. Ex- 
perienced linguists may wish to use different tools for the correcting and the 
updating of the fiat database. Rows with multiple columns from spreadsheets 
can be inserted directly into XML data files, as long as the columns are sorted 
in the expected order. 

8.4 Expression Types 

In each database entry, there is one source expression, at least three target 
expressions, and possibly one explanation. The ISO 8859-7 encoded source 
expression and the first ISO 10646-1 encoded target expression may be a: 

— single uppercase or lowercase character with or without spiritus and/or 
accent 

— partial word, such as prefix, intermediate syllable, suffix 

— complete word 

— chain of combined words 

— combination of partial word pairs, such as a suffix followed by a prefix 

— mixture of complete and partial words, such as a complete word followed by 
a prefix or a suffix, followed by a complete word 

The rest of the target expressions represent the same information as the 
first in other output formats, namely for 7-bit Greek (E)TeX and HTML as 
well. The intelligence of the plovo2ttoXv system currently lies in the database, 
so while creating and editing entries, it is crucial to write them correctly. 

8.5 Editing Tools 

One of the most powerful Unicode editors is the Java-based Simredo 3.x by 
Cleve Lendon, which has a configurable keyboard layout, and is thus suitable 
for this sort of task. The latest version of Simredo, 3.4 at this writing, can be 
downloaded from http://www4.vc-net.ne.jp/~klivo/sim/simeng.htm, and 
installed on any platform supporting JDK 1.4.1 from Sun. Simredo can be 
started by typing java SimredoS or perhaps java -jar SimredoS. jar in the 
shell window (Linux) or in the command window (Windows). Unicode/XML 
with Simredo has been successfully tested on Windows XP and on SuSE Linux 
8.1 Professional Edition. 
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Fig. 9. Another useful tool is a character mapping table like this accessory on Windows 
XP, which displays the shape and the 16-bit big endian hexadecimal code of the selected 
character. 

The author would be happy to assist in the preparation of a polytonic 
Greek keymap file ( . kmp) for Simredo, but the manual may prove sufficient. The 
creation of such a keymap file is easily done by simply writing one line for each 
key sequence definition. For instance, given the sequence 2keys ; A"A using the 
desired Unicode character, or the equivalent sequence 2keys ; A\ulF0D with the 
big endian hexadecimal value, one can produ ce a n uppercase alpha with spiritus 
asper and acute accent by pressing the |T| and keys simultaneously. According 
to the Simredo manual, other auxiliarykeys such as Alt can be combined with 
vowel keys, but not Ctrl. 

Some other Unicode editors: 

— For Windows: http://www.alanwood.net/unicode/utilities_editors. 

html. 

— For Linux: http://www.unicodecharacter.com/unicode/editors.html. 

— For Mac OS: http://free.abracode.com/sue/. 
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Unfortunately, XMLwriter and friends neither support configurable keyboard 
layouts nor display 16-bit Unicode. 



8.6 Polytonic Keyboard Drivers 

Instant interactive multi-accenting while editing Greek documents is available ei- 
ther through plug-ins for some Windows applications, such as SC Unipad (http : 
//www. unipad.org/main/) and Antioch (http://www.users.dircon.co.uk/ 
~hELncock/cLntioch.htm), or with the help of editable keyboard mapping tables, 
such as the Simredo Java program described above. Regrettably, the Hellenic 
Linux User Group (HEL.L.U.G., http://www.hellug.gr and http://www. 
linux.gr) has no recommendations for polytonic Greek keybord support. 

Whatever polytonic keyboard driver has been installed and activated may be 
useful for new documents, but does not much help the author who is not familiar 
with the complicated rules of polytonism! 



8.7 Auxiliary Tables 

Preparation and periodic updates of auxiliary tables can of course be done with 
any software supporting Unicode. Spreadsheets have the advantage of putting 
the entries into cells row-by-row and thus organizing the parameters by column. 
This may prove easier than directly writing the XML file. See figure 10. 

A row in such a . csv file looks like this: 

"P" , "T" , "\char ’074 R" , "Ῥ " , "T66o<;" 

Of course it then must be re-shaped with element and attribute tags to make an 
XML-syntax database entry. 
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Export of text files 
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Character set 
Field delimiter 
Text delimiter 
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■3 
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Fig. 10. Using a spreadsheet to produce a long extendable list with five columns, which 
then can be saved as a . csv file. Be careful with the parametrization! 
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Languages 


Type 


Font 


Personal information 


Normal 


Times New Roman ^ 


Advertising 


-leading 1 <H1> 


Times New Roman 


Search 


Heading 2 <H2> 


Times New Roman 


^ Skin 


Heading 3 <H3> 


Times New Roman 


TKilbars and menus 


Heading 4 <H4> 


Times New Roman 


Mouse and keyboard 


Heading 5 <H5> 


Times New Roman 


Windows 


Heading 6 <H6> 


Times New Roman 


Sounds 


Preformatted text <PRE> 


MgMemorlesApla UC Pol 


Q Fonts and colors 


Forms text field multiline 


Arial UnKode MS 


Page style 


Forms text field ^gleDne 


Arlal ^ 


Multimedia 


■ 




0 Programs and paths 
E-mail 


Minimum fort size (pixels) 


|i 


File types 
Default application 


Background color 


— 1 


SI Network 

History ard cache 


Link style 


My link style... 


Privacy 

Security 


International 


International fonts. . . 




Choose fort: to be used when text is not correctty displayed. 
Writing system | Greek Extended 

Normal font Monospace font 



Automatic (PalaOno Linotype) 
Palatino Linotype 



No fonts avaiable 



aAe’EpP 



aAe’EpP 



Fig. 11. Choosing the Unicode font for viewing in a browser. 



8.8 Viewing Tools 

Users without any programming knowledge may find it useful to open and inspect 
the header and the body of the XML database before using it in polytonic 
documents. Here is a procedure for doing that. 

First, set the Unicode font in the preferences of the desired browser (Fig- 
ure 11). These days, most browsers support this, including Internet Explorer, 
Konqueror, Netscape Navigator and Opera. 

Then, select Unicode UTF-16 as the default encoding (Figure 12). The 
browser can now detect syntax errors, giving immediate feedback (Figure 13). 

8.9 Priorization of Database Entries 

Polytonic exceptions (e.g., ouxe and &OTe without circumflex) and especially 
ambiguities (e.g., tiou ^ not) or tioO, toO ^ tioO; tiwc; — *■ Tube; or tiwc;, Ti:<bc; — *■ ttcoc;) 
have the highest priority in the database, then the special expressions, while 
the simple, casual and obvious accented syllables or particles have the lowest 
priority. In order to avoid mis-accented and mis-spirited syllables as much as 
possible, entries must be in the appropriate order. 

For example. Table 3 shows lexical rules defining eight variations of the Greek 
interrogative pronoun tl (= “which”) as a single monotonic expression: 

— with and without neutral accent 

— with and without Greek question mark 
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Style 


— ► 


1 Encoding ► | 


Refresh display 


Source 


Ctrl+F3 


Frame source 


Alt+F3 


Links... 


Ctrl+J 


Small screen 


Shift+Fll 


Full screen 


Fll 



J Automatic selection 



Unicode 



Western 

Central European 
Southern European 
Baltic 
Nordic 
Celtic^ 



D 



UTF-8 



Q UTF-16 



UTF-32 


■ 


UTF-7 








Fig. 12. Selecting UTF-16 for the default encoding. 




Fig. 13. Example error message from browser. 



— standalone 

~ leading word in the sentence 

— intermediate word in the sentence 

— trailing word in the sentence 

Database entries like these are needed to account for the variations shown in 
Table 4. As a rough analogy in English, it is as if Table 3 shows variations on 
“I”: Initial position (“I went to the store”); after a verb (“What do I know?”); 
etc., and then Table 4 shows that “I” isn’t always capitalized: “It looks good” 
vs. “Can you see it?” 

The above does not cover all cases related to ti. The monotonic text 
may be accented in two different ways in polytonic Greek (namely, 

YiotTi); additional entries would be required to handle this. 

9 Polytonic Printing Press 

The author has found several Greek newspapers, magazines, and books, includ- 
ing university presses, using polytonic Greek: 
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Table 3. Eight variations of ti as a monotonic expression. 

<! — ep(t)TT|p,aT i avT(t)vup,tec — > 

<a p="Ti;" u="Tt;" T="T’i;" 6="Τί; " ?="Tt;"> </a> 

<a p="Ti " u="Tt " T="T’i " S="Τί  " ?="Tt X6c;"> </a> 

<a p=" Ti;" u=" Tt;" t=" t’i;" 8=" τί; " i^="Ki ‘ eypate 

Tt ; "> </a> 

<a p=" Ti " Tt=" Tt " T=" t’i " 8=" τί  " i^="Kal xt 
feYpa(|)e;"> </a> 

<a p="Ti;" u="Tt ; " T="T’i;" 8="Τί; " ?="Tt;"> </a> 

<a p="Ti " Ti="Tt " T="T’i " 8="Τί  " ?="Tt X6c;"> </o> 

<a p=" Ti;" n=" Tt;" t=" t’i;" 8=" τί; " i^="Ki ‘ feypate 

Tt ; "> </a> 

<a p=" Ti " n=" Tt " T=" t’i " 8=" τί  " ^="Kal Tt 
feypa(|)e;"> </a> 



Table 4. Similar but unrelated syllables treated differently. 



Position 


Syllable Polytonic examples 


leading 


TI-, Ti- Tijri], Tiooa(pspvr]C 


intermediate -ti- dSuvdTiojra 


trailing 


-TL XaXL, TIpaY[iC(TL 


leading 


TL-, Tl- TLTIOTa, TLYpr]C 


intermediate -xi- exTiitrioT] 


trailing 


-Tl (J;co[ri, Tupi 



— daily newspapers: 

• HKA0HMEPINH 

• ESTIA 

— monthly magazines: 

• NEMECIS 

— book publishers: 

• EKAOTIKH AOHNm 

• KYPIAKIAHS 

• PEflPPIAAHS 

• nAnANiKOAAor 

• INAIKTOS 

• PNflSH 

• KAAO^flAIAS 

— academic, polytechnic and university presses: 

• Academy of Athens 

• Polytechnics of Athens 

• University Publications of Crete 

• University of loannina 

• Democritian University of Thrace 
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— private educational organizations: 

• KOPEAKO 

• MflPAITH 

— others: 

• Hellenic Parliament 

• military press 

• Orthodox Church 



10 Testing 

The following testing procedure was used for plovo2ttoXv development. The 
author worked on SuSE Linux and Windows XP, but any platform (Linux, 
Unix, Mac or Windows) should work as well, as long the JDK is installed. 

1. Visit any web site with rich Modern Greek content, for example, news sources 
such as http : / /www . pathf inder . gr . 

2. Open a new document with a word processor supporting spell checking of 
monotonic Greek. 

3. Copy a long excerpt of continuous text from the web site. 

4. Paste the selected and copied text into the word processor window. 

5. Correct any misspelled words, but do not use any style or font effects. 

6. Save the document as plain ISO 8859-7 encoded text file. 

7. Process the document with ^ovo2t:oXv, as a Java application from a console 
window. 

8. Take the 7-bit TpjX result and add it to the DTp;X template file in your 
favourite environment (LyX, Kile, MiKTgX, etc.). 

9. Produce a .ps or a .pdf file and check the final result with GSview or some 
other reader. 

The results improve as the database is enriched. However, some manual 
editing is inevitable, depending on the complexity of the document to be multi- 
accented, because authors may mix Ancient Greek phrases into Modern Greek 
sentences. 



11 Future Developments 

One important improvement would be to relocate some of the intelligence to 
external script files, for defining and modifying the polytonic grammar rule sets. 

Another avenue is to integrate ^ovo2ttoXv with the source and data files of 
the open source Writer2DTgX project (by Henrik Just, http://www.hj -gym. 
dk/~hj/writer21atex/). That would provide a reverse conversion, from UTF- 
16BE/LE encoded Greek documents into 7-bit (D)Tp;X. 
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12 Related Links 

Here we list some further readings on the complexity of Greek multiple accenting 
and related subjects. First, these articles (mainly written in Greek) on the 
importance of the spiritus lenis and especially of the spiritus asper: 

— http : / / WWW . typos . com . cy/nqcontent . cf m?ajLd=4681 

— http : //www.kairatos . com.gr/polytoniko . htm 

— http : //www . krassanatkis . gr/ tonos . htm 

— http : //www.mathisis . com/nqcontent . cfm?aj.d=1767 

Further general sources are the following: 

— Ministry of National Education and Religion Affairs - http: //www. ypepth. 
gr/ en_ec_home . htm 

— Institute for Language and Speech Processing - http://www.ilsp.gr 

— http://www.ekivolos.gr 
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Abstract. We are using to typeset an old Marathl-English dic- 

tionary, dated 1857. Marathi is the official language of Maharashtra, a 
western state of India. Marathi (h <, I <j 1 ) is written using the DevanagarT 
script. The printed edition of the dictionary contains approximately 
1000 Royal Quarto size (9^ x 12| ) pages with around 60,000 words. 
The roots of the words come from many languages including Sanskrit, 
Arabic and Persian. Therefore the original dictionary contains at least 
three different scripts along with many esoteric punctuation marks and 
symbols that are not used nowadays. 

We have finished typesetting 100 pages of the original dictionary. We 
present our experiences in typesetting this long work involving Devana- 
garT and Roman script. For typesetting in DevanagarT script we used the 
devnag package. We have not yet added the roots in other scripts but 
that extension can be achieved with the help of ArabT{{X. We want to 
publish the dictionary in electronic format, so we generated output in 
PDF format using pdfIM]|^. The bookmarks and cross-references make 
navigation easy. In the future it would be possible to design the old 
punctuation marks and symbols with the help of METfiFONT. 



1 Introduction 

Marathi is a language spoken in the Western part of India, and it is the official 
language of Maharashtra state. It is the mother tongue of more than 50 million 
people. It is written in the DevanagarT script, which is also used for writing 
Hindi, the national language of India, and Sanskrit. The script is written from 
left to right. A consonant and vowel are combined together to get a syllable, in 
some cases consonants can be combined together to get conjuncts or ligatures. 
While combining the vowel and a consonant one might have to go to the left of 
the current character - which is a big problem for a typesetting program. 

We are typesetting a Marathl-English dictionary compiled by J. T. Moles- 
worth and published in 1857. The dictionary is old so there is no problem about 
copyright. This will be the first Marathl-English dictionary in an electronic 
format. 



A. Syropoulos et al. (Eds.): TUG 2004, LNCS 3130, pp. 55—58, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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2 Devanagari Script 

There are 34 consonants, 12 vowels, and 2 vowel-like sounds in Marathi. Table 
1 gives the consonants along with some common English words to illustrate the 
sounds. In some cases, there is no exact equivalent English sound, and we give 
those with standard philological transliteration. The h in this table designates 
aspiration, and a dot under a consonant designates retroflexion. Although Hindi 
and Marathi use the same Devanagari script, the consonant which is used 
in Marathi is not used in Hindi. Similarly some characters used in Sanskrit are 
not used in Marathi. All the consonants have one inherent vowel 3T (a), and in 
order to write the consonant itself without the vowel, a special “cancellation” 
character (^) called virdma, must be used. For example, is W -f 3T, where 3T 
is a vowel. 



Table 1. 


Devanagari consonants. 






n 


W 




car 


kh 


go 


gh 


nasal 








W 


if 


chair 


cch 


jail 


zebra 


nasal 




S’ 


S’ 


S’ 


T 


t 


th 


d 


dh 


n 




«T 


T 




T 


T ehran 


th 


dark 


dh 


new 


T 




W 




K 


pair 


fail 


bat 


bh 


man 




T 




T 




yellow 


road 


love 


way 






T 




W 


S’ 


share 


s 


sun 


happy 





Table 2. Devanagari vowels. 



3T 


3TT 


w 


t 


T 


ST 


about 


car 


sit 


seat 


put 


root 


W 




IT 




3TT 




under bottle 


say 


by 


road load 



3T 



Table 2 lists the vowels and the two vowel-like sounds. The first two rows 
give the vowels and the last row gives the vowel-like sounds, called anuswdra and 
visarga, respectively. In general, the vowels are paired with one member short, 
the other long. For W, make the r into a syllable by itself. 

A vowel is added to a consonant to produce a syllable, for example all the 
consonants written above already have the vowel 3T. Suppose we want to get the 
sound, 6'orah, with a long a. We add the second vowel 3TT to ^ to get where 
we can see a bar added behind the consonant W. 
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Now to write sit we add the third vowel T to and here it gets difficult, be- 
cause although the i is pronounced after the s, it is written before the consonant. 
We get fn" , where a bar is added before the character 

Syllables can even be formed using more than one consonant and a vowel. 
For example, 'jlqdl'Mi , here we add T, ^ and It can also be written as . 

There are many such variations when two or more consonants are combined, and 
some conjunct characters look nothing like their constituent parts. For example, 
T or r is written in four different ways depending on the other consonant in the 
conjunction. 

The first vowel- like sound, anuswara, is the nasal consonant at the end of 
each of the first five consonant rows in the consonant table. For example, ^tnr 
(Ganges), here the “ ” on the first character is the anuswara but it is pronounced 
as the nasal sound in the row of the next character, which is iTr. The sound is 
like 3^. Visarga is more or less a very brief aspiration following the inherent vowel 
(and for this reason it is usually written h in philological transcription). 

3 Problems 

We tried many approaches before choosing Scanning the pages was out 

of question as the printed quality is very poor. Also many of the consonant 
conjuncts are written in a different way nowadays, so it would be difficult for 
the average modern reader to decipher the old dictionary. There are some Web 
sites that have dictionaries in two scripts using Unicode. But in many cases it 
does not show the correct output, and it is difficult to find suitable viewers. 
We thank referees for mentioning an XML approach, but we did not try that. 
We also tried Omega, but there was hardly any information available when we 
started our work more than two years ago and also the setup was very difficult. 

The first problem was having two scripts in the text, and typesetting it 
such that both scripts mesh well. Moles worth uses Marathi words to explain 
the concepts so Devanagarl script text appears also in the meaning. Also there 
are couplets of a poem in places to explain the usage. Many Marathi words 
have roots in Sanskrit, Hindusthani, Arabic and Persian. Arabic, Persian, and 
the Urdu variant of Hindusthani are written using the Arabic script, which is 
the third script used in the dictionary. In Marathi, a word is spoken - and also 
written - in a slightly different way depending on the region of the publication. 
Therefore in the dictionary, the most used form usually has the meaning listed 
for it, and all other forms have a note pointing to the most used form. This 
requires cross-referencing for faster use. 

The dictionary has a long preface giving the details of how the words were 
chosen, which meanings were added, and so on. It contains different symbols 
and punctuation marks. Also in the meaning of some words, symbols are used 
to show the short form used during that period, which is obsolete now. 

The printed dictionary is heavy, so carrying it everywhere is out of question. 
We wanted to give the user the possibility to carry the dictionary on a compact 
disc or computer. Therefore the next question was, which is the most user- 
friendly and/or popular output format? 
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4 Solution 

In a single word: We mainly used two packages to do the typesetting: 

lexikon for dictionary style, and devnag for Devanagarl script. It is a two step 
process to typeset in Devanagarl script. A file, usually with extension .dn, is 
processed with devnag, a program written in the C language, to get a .tex 
file. The preprocessing step is necessary due to the problem of vowel placement, 
complex conjunct characters, and so on, as mentioned in the introduction. The 
style file dev is used to get the Devanagarl characters in the output pdf file after 
compiling using pdfIAT[5]X. 

Once we have a .tex file we can get output in many formats, DVI, PS, pdf, 
etc. We chose pdf as there are free readers for almost all platforms and pdflATf^X 
makes it easy to go from TgX to pdf. The hyperref package solved the problem 
of cross-referencing and bookmarks. The user can click on a hyper linked word 
to go to the form of the word that has the complete meaning, and come back to 
the original word with the back button in his favourite reader. In addition to the 
hyperlinks, bookmarks make navigation much easier; for example, bookmarks 
point to the first words starting with aa, ab, ac, etc. An additional nested level 
of bookmarks is chosen if there are many words starting with the character 
combination. For example, if there are many words starting with ac then we 
also have bookmarks for aca, acc and so on. Usually there are fewer than five 
pages between two bookmarks, so finding a word is not time consuming. 

The preface contains characters like 3TTT, which is not part of the modern 
Marathi character set, but which was used as a short form a hundred years ago. 
To typeset this character we directly edited the .tex file after preprocessing to 
get the required result. 

We have attached at the end of this article an annotated sample page from 
the typeset dictionary. At the top of the page the first entry is the first word 
on the page, then the copyright information with our name for the dictionary, 
, simply translated as “the world of words”, followed by the entry of 
the last word on the page. On the right hand side is the page number. At the 
bottom, the page number is given in Devanagarl script. 

5 Future Work 

After completing the typesetting the whole dictionary we will add the roots of 
the words in Hindusthani, Arabic and Persian. Currently we denote this using 
[H] , [A] or [P] , respectively. We have tried typesetting in three secripts on some 
small examples and did not find any conflicts between ArabTJ^X and devnag . 
We have not yet created new symbols but it is possible with the help of the 
pstricks package or METHFONT. 
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Abstract. Several files with Greek hyphenation patterns for can 
be found on CTAN. However, most of these patterns are for use with 
Modern Greek texts only. Some of these patterns contain mistakes or 
are incomplete. Other patterns are suitable only for some now-outdated 
“Greek packages. In 2000, after having examined the patterns that 
existed already, the author made new sets of hyphenation patterns for 
typesetting Ancient and Modern Greek texts with the greek option of 
the babel package or with Dryllerakis’ GreeKiQgX package. Lately, these 
patterns have found their way even into the ibycus package, which can be 
used with the Thesaurus Linguae Graecae, and into 17 with the antomega 
package. 

The new hyphenation patterns, while not exhaustive, do respect the 
grammatical and phonetic rules of three distinct Greek writing systems. 
In general, all Greek words are hyphenated after a vowel and before a 
consonant. However, for typesetting Ancient Greek texts, the hyphen- 
ation patterns follow the rules established in 1939 by the Academy of 
Athens, which allow for breaking up compound words between the last 
consonant of the first constituent word and the first letter of the sec- 
ond constituent word, provided that the first constituent word has not 
been changed by elision. For typesetting polytonic (multi-accent) Mod- 
ern Greek texts, the hyphenation rules distinguish between the nasal and 
the non-nasal double consonants [rn, vt, and yx. In accordance with the 
latest Greek grammar rules, in monotonic (uni-accent) Modern Greek 
texts, these double consonants are not split. 



1 Introduction 

Before 2000, one could find on CTAN four different files with hyphenation pat- 
terns for Modern Greek only, namely 

— rgrhyph.tex by Yannis Haralambous [1], 

— grkhyphen.tex by Kostis Dryllerakis [2], 

— gehyphen . tex by Yiannis Moschovakis [3], and 

— grhyph.tex by Claudio Beccari [4]. 



A. Syropoulos et al. (Eds.): TUG 2004, LNCS 3130, pp. 59—67, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The first two hyphenation-pattern files [1,2] are almost identical. The only differ- 
ence is that the patterns by Dryllerakis contain an \endinput command several 
lines before the end-of-file. (Probably, Dryllerakis cut down Haralambous’ pat- 
terns to reduce memory usage, at a time when memory space was still rather 
limited.) The patterns by Moschovakis [3] are not only limited to Modern Greek, 
but they have been “frozen” based on an obsolete mixed US-Greek codepage for 
DOS and an equally obsolete DT^X 2.09. The end result is that some words 
containing vowels with combined diacritical marks (e.g., siboc, 'dsw, etc.) are not 
hyphenated at all. 

Haralambous’ patterns [1] do not provide for the correct hyphenation of 
combinations of three or more consonants. In addition, they do not allow for 
the hyphenation of the nasal consonant combinations [tti: (mb), vt (nd) and yx 
(ng), which must be split in poly tonic Modern Greek. Haralambous’ patterns 
erroneously split the combination xp and prohibit the hyphenation of all final 
two-letter combinations for no apparent reason. 

Beccari’s patterns [4], which are commonly used with the greek option of ba- 
bel, contain a number of mistakes and are also incomplete. For example, the word 
TiuxvoTTiTa is hyphenated as Ti:ux-v6-Tr]-Ta. According to some rules outlined fur- 
ther in this text, that word should have been hyphenated as Ti:u-xv6-Tr]-Ta. Similar 
bad hyphenations include lo-ilpoc (it should be t-oilpoc), ’AXx-pV)-vr] (it should 
be AX-xpt)-vr]), etc. Beccari’s patterns also allow for separation of the consonant 
combinations 8p, 8v and tX. These combinations should not be split, because one 
can find some Ancient Greek words that start with such combinations (bpcoc;, 
8vo(pep6<;, xXripcoouvr]) . 

In 2000, while typesetting a large volume in polytonic Modern Greek, the 
author of the present article noticed the mishaps in Beccari’s hyphenation pat- 
terns and the inadequacy of all other Greek hyphenation patterns. He noticed 
also that hyphenation patterns for Ancient Greek, although they had been dis- 
cussed by Haralambous back in 1992 [5], were not available at all in the public 
domain. That was the incentive for the author to revise the existing hyphen- 
ation patterns for Modern Greek and to provide in the public domain a set of 
hyphenation patterns for Ancient Greek. 

The author has already presented these patterns in the newsletter of the 
Greek Tg;X Friends [6,7], but this communication is the first (and long overdue) 
presentation of the patterns to the global T[;5X community. The patterns were 
created for the 1988 de facto Levy Greek encoding [8], which later became the 
Local Greek (LGR) encoding. 

2 Creation of Patterns 

One way to produce hyphenation patterns is by using PATGEN [9]. PATGEN scans 
a given database with hyphenated words and prepares a set of hyphenation pat- 
terns based on observations the programme has made. Another way of using 
PATGEN is modular [10]: first one creates a limited set of hyphenated words, 
then runs PATGEN on these words, checks the produced hyphenation patterns 
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and expands the list of hyphenated words with those words that were badly hy- 
phenated. The whole cycle create word list-run PATGEN-c/iecfc had hyphenations- 
expand word list is repeated until an acceptable set of hyphenation patterns 
is produced. To the author’s knowledge, an electronic dictionary with hyphen- 
ated words does not exist for Greek. Given the excessive morphology of the 
Greek words, even the modular use of PATGEN would be a daunting task. A less 
time-consuming effort is the translation of the simple grammatical rules for the 
hyphenation of Greek into patterns for the TgX machine as it has already been 
done [1,3,4]. This is the solution chosen also by the author of the present article. 

Each language has its rules and exceptions that must be duly respected. It is 
not rare for one language to have different hyphenation rules for different dialects, 
or to have different hyphenation rules for texts written in different eras. The 
best-known example is English, where some words are hyphenated differently 
depending on the continent (e.g., pre-face in British English and pref-ace in 
American English). 

In the case of Greek, one has to distinguish - grossly - between three “di- 
alects” that demand separate sets of hyphenation patterns: 

1. Ancient Greek and old-style literate Modern Greek (katharevousa), 

2. polytonic Modern Greek, and 

3. monotonic Modern Greek. 

Ancient Greek is considered essentially every text that has been written in Greek 
from Homeric times (8th century B.G.) to about the end of the Byzantine Empire 
(15th century A.D.). Katharevousa (literally, the purifying) is a formal written 
language (almost never spoken) conceived by Greek scholars in the period of the 
Enlightenment as a way to purify Modern Greek from foreign influences. It was 
used in Greek literature of the 19th and early 20th century, and by the Greek 
state from its creation in 1827 until 1976. It is still used by the Greek Orthodox 
Ghurch. 

Polytonic and monotonic Modern Greek are essentially the same language. 
The only difference is that polytonic (literally, multi- accent) Modern Greek uses 
all accents, breathings and diacritics of Ancient Greek and katharevousa, while 
monotonic (literally, uni-accent) Modern Greek, which was adopted officially in 
Greece in 1982, has just one accent mark (much to the dismay of some classicists) . 

The hyphenation rules for Ancient Greek and katharevousa have special pro- 
visos for compound words [11]. The hyphenation rules for polytonic Modern 
Greek make a distinction between nasal [tti:, vt and yx (pronounced as mb, nd 
and ng respectively) and non-nasal pTi, vt and yx (pronounced as b, d and g) [12]. 
The hyphenation rules for monotonic Modern Greek do not distinguish between 
nasal and non-nasal pTi, vt and yx, nor do they make any special demand for 
compound words [13]. 

2.1 Patterns for Modern Greek 

Monotonic Texts. The grammatical rules for the hyphenation of monotonic 
Modern Greek [13] and the corresponding hyphenation patterns are: 
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1. One consonant between two vowels always remains together with the second 
vowel. This rule can be seen slightly differently: a word is hyphened after each 
vowel, for example, Tr]-Xs-6-pa-or]. With the Levy character encoding [8], the 
corresponding hyphenation patterns are: al el hi il ol ul wl. 

2. Double vowels (diphthongs in Ancient Greek) that are pronounced as one 
are not hyphenated. Hence the double vowels ai, ai, au, etc. should not be 
split apart. The corresponding hyphenation patterns are: a2i a2’i a2u . . . 
u2i u2’i. However, when the first vowel is accented, the two vowels are to 
be pronounced separately and they can be hyphenated. Hence, we include 
some exceptions: ’a3u ’e3u ’o3u ’u3i. 

3. Semi-vowels are not hyphenated. Vowels, simple and double, that are usually 
pronounced as i are sometimes semi- vowels, i.e., they are not pronounced 
totally separately from the preceding or following vowel. Hence some vowel 
combinations involving semi- vowel sounds (j) should not be split apart. The 
most common semi- vowel combinations are: aj (vs-pdi-6a), ej (Csi-pTte-xric) 
and oj (xo-p6i-8o) when they are accented on the first vowel or when they 
are not accented at all, and the combinations ja (8ia-pd^w), je (e-Xi^c) and 
jo (Ma-pico) when they are accented on the second vowel or when they are 
not accented at all. The resulting hyphenation patterns are: a2h a2"i ... 
i2a i2’a ... u2w u2’w. Some notable exceptions are: ’a3h . . . ’u3w. 

It is worth noting that there is an inherent difficulty in distinguishing be- 
tween vowels and semi-vowels. Sometimes, two words are written the same, 
but they are pronounced with or without a semi-vowel, thus completely 
changing their meaning, e.g., 86-Xia (the adjective devious in feminine singu- 
lar) and 86-Xi-a (the adverb deviously). Distinguishing between a semi- vowel 
and a true vowel is very difficult and requires textual analysis [14]. For the 
purpose of Tj5]X, all such suspicious semi- vowel combinations are treated as 
semi- vowels. The end result is that the word a'K6r\ypQ will be hyphenated 
as But it is better seeing some words hyphenated with one less 

syllable, than seeing extra syllables in other words, e.g., po-ri-ffa na-va-yi-d! 
(Apparently, Liang took the same approach, disallowing some valid hyphen- 
ations for the sake of forbidding definitely invalid ones [15].) 

4. Single or double consonants at the end or the beginning of a word do not 
constitute separate syllables. The corresponding patterns are 4b. 4g. ... 
4y., .b4 .g4 . . . .y4. To these patterns, one must add some other ones for 
the case of elision: 4b’ ’ 4g’ ’ ... 4y’ ’ . 

5. Double consonants are hyphenated. The patterns for this rule are: 4blb 4glg 
... 4qlq4yly. 

6. Consonant combinations that cannot be found at the beginning of Greek words 
must be split after the first consonant. The patterns are: 4blz 4blj . . . 4ylf 
4ylq. No distinction is made between nasal and non-nasal pm (mb/b), vt {nd/ 
d) and yx {ng/ g); these consonant combinations are not to be split. However, 
some other patterns are inserted to deal with some thorny combinations of 
three or more consonants: 
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4r5g2m 


^p-ypia (Anc. Gr.) 


tz2m 


pidva-T^pisvT 


4r5 j 2m 


xop-Upioc; 


4m5y2t 


Xdpi-(|iTe 


4glkt 


sXsy-XTtic 


4nltz 


vepav-T^id 


4nlts 


PioXov-ToeXo 



More patterns could have been inserted here to deal with non-Greek proper 
names with sequences of three or more consonants transliterated into Greek. 
For example, the pattern 4r512s could have been added to hyphenate the 
transliterated name Carlson as Kdp-Xaov and not as KdpX-aov (the latter 
is not allowed according to Greek grammar rules). However, the number of 
such words is infinite and the effort most likely worthless. 

7. Two or more consonants at the end of a word do not constitute separate 
syllables. Such endings are mostly found in Ancient Greek words, or words of 
non-Greek origin which have became part of the Modern Greek vocabulary: 
4kl. (lu-vdxX) ... 4nc. (eX-pivc, Anc. Gr.) ... Such words can be found 
easily in reverse dictionaries of Modern Greek [16]. 

8. Combinations of double consonants are separated. These are some rare com- 
binations of non-nasal pTi with vt and/or yx in words of non-Greek origin 
which are now part of the Modern Greek vocabulary, e.g., 4mplnt (popx- 
VTSodpTip = robe-de-chambre). 



Polytonic Texts. The hyphenation rules that apply to monotonic 
Modern Greek texts apply also to poly tonic Modern Greek texts. Of course, the 
patterns for polytonic Modern Greek had to be expanded to include all possible 
combinations of vowel and diacritic (breathing, accent and/or iota subscript). 

As mentioned above, polytonic Modern Greek has another notable difference 
in hyphenation: The nasal pm, vt and jx, which are pronounced as mb, nd and 
ng respectively, are to be separated. On the contrary the non-nasal pm, vt and 
yx, which are pronounced as b, d and g, must not be separated. In general, pm, 
VT and yx are nasal, thus the patterns: 4mlp, 4nlt, and 4glk. These consonant 
combination are non-nasal when they follow another consonant: oX-pmou-po, osp- 
VTdc, dp-yx6, etc., or in words of non-Greek origin: ’I-pmpaV)pi, pmi-VT^C, etc. 

For the creation of hyphenation patterns, the non-nasals pm, vt and yx can 
be treated in the same way Haralambous treated Ancient Greek compound 
words [5]. Hence, with the help of Andriotis’ etymological dictionary [17], a 
list of exceptions was built such as: 

.giou5g2k riou-yxooXdpoc 
5g2krant . BoXyxo-yxpavT 

. qa5n2to 
. qa5n2tr 
. q’ a5n2tr 



Xa-VTo6pir](; 

Xa-VTpc5v 

Xd-VTpa 
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The list of all these exceptions is quite lengthy and covers five printed pages of 
Eutypon [6]. 

2.2 Patterns for Ancient Greek 

The grammatical rules for hyphenation of Ancient Greek are mostly the same as 
those for poly tonic Modern Greek. Apparently, the Ancient Greeks hyphenated 
following the simple rule that a single consonant between two vowels in one 
word belongs with the second vowel: oo-cpi-Cw, xa-tld-Tiep. The Ancient Greeks 
also considered non-accented words as being part of the following word [18]. For 
example, the Ancients would hyphenate sx toOtou as e-XTob-TOU. Nonetheless, 
rules introduced by later scholars do not allow for such extravagant hyphenations. 

A very tricky rule introduced by modern scholars states that “[Ancient Greek] 
compound words divide at the point of union” [18]. This rule has been extended 
to katharevousa and some typographers are still using it for polytonic Modern 
Greek (most likely mistakenly). That rule also appears in two variations. In 
one variation, which has been adopted by The Chicago Manual of Style [19], 
compound words are divided into their original parts irrespective of whether 
those original parts have been modified or not. Therefore, one should hyphenate 
OTpaT-riYoc (oTpaxov + ayw), Aioo-xoupoc (Aide + xoOpoc), etc. This is the rule 
followed by Haralambous for the creation of Ancient Greek hyphenation patterns 
for the commercial package Scholar [5]. In another variation, adopted by 
some 19th-century scholars [20] and the Academy of Athens [21], compound 
words are divided into their original constituent words only when the first word 
has not lost its last vowels by elision. According to that rule variation, the word 
OTpaTTiYOC should be hyphenated as OTpa-xriYOC, because the first word (oTpaxov) 
has lost its final ov. 

For the creation of hyphenation patterns for Ancient Greek, the author chose 
to follow the rule adopted by the Academy of Athens, because this rule has also 
been adopted in the manuals used in the Greek high schools and lycees [11]. 
Thus, with the help of two widely-used dictionaries [22,23], a list of exceptions 
for compound words was incorporated into the list of patterns for Ancient Greek: 

>adi’e2xl d6i^5-o6o<; 

>adie2xl d6is5-66ou 

> adu2 s 1 ’ w 6t8ua-G)Ti:r]T0(; 

>adu2slw dSua-WTifiTOU 

i2slqili’akic. 6io-x Ataxic, etc. 

i2slmuri ’ akic . Bio-pupidxu;, etc. 

This list is quite extensive; it includes 1555 patterns and covers twenty-eight 
printed pages [24]. 

It is worth mentioning here that special care has been taken not to confuse 
Ancient and Modern Greek exceptions for the division of consonants. For ex- 
ample, there are no Ancient Greek words that start with the Modern Greek 
double consonants pTi, vt, yx, and to. Therefore, all these combinations are 
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divided in Ancient Greek texts, with no exception. Also, combinations of stopped 
consonants (tt, p, cp / t, 8, t) / x, y> x) nasals p or v are not divided [20]. 

3 Putting the Patterns to Work 

The patterns have been archived in three files, which have already found their 
way onto CTAN [24]: 

— GRMhyph7.tex for monotonic Modern Greek, 

— GRPhyph7.tex for polytonic Modern Greek, and 

— GRAhyph?. tex for Ancient Greek. 

(The ? is a number indicating the current version.) 

The first two patterns for Modern Greek were tested on a sample text created 
by the author, after building a new bplain format [6]. (Incidentally, bplain is 
just an extension of Knuth’s plain format for the use of several languages with 
babel.) The result showed considerable improvement in comparison to hyphen- 
ation results obtained by earlier set of patterns (Table 1). With another bplain 
format, the hyphenation patterns for Ancient Greek were tested on five classic 
texts in their original: 

— Herodotus, The Histories A, I-III; 

— Xenophon, Anabasis A, 1. IV. 11-13; 

— Plutarch, Lives Themistocles, II. 1-5; 

— Strabo, Geography, 7. 1.1-5; and 

— Lysias, Defence against a Charge for Taking Bribes. 

Surprisingly, correctly hyphenated all Ancient Greek words found in these 
texts, which cover about seven printed pages [7,24]. 

The author, however, does not believe that his patterns are error-free. The 
Ancient Greek adjective xpooxoTir] has two different etymologies and meanings: 
“looking out for” hyphenated as xpo-axoTit) (iipo -I- axoxew), or “an offence” hy- 
phenated as Tipoa-xoTit) (iipoc; -I- xoxoc;) . Unfortunately, T^X does not do textual 
analysis and will not understand the difference. Syllables with vowel synizesis 
may be erroneously split apart, e.g., xpu-o^-to instead of xpb-oeco. Again, TJi]X 
does not do textual analysis and it is impossible for a typesetting system to 
capture such small details. Finally, the use of the same patterns for typesetting 
a mixed Ancient and Modern Greek text will bring a few surprises. For the 
purpose of T[;]X, Ancient and Modern Greek are better treated as two different 
\languages. 



4 Creation of Patterns for ibycus and 

The patterns created by the author have already been picked up by other people 
who are working on other packages or systems that use different font encodings. 
Using a Perl script, Apostolos Syropoulos adapted the hyphenation patterns for 




66 Dimitrios Filippou 



Table 1. Results from hyphenation tests with three different sets (files) of hyp henation 
patterns available in the public domain. Mistakes represent erroneous h yphenations. 
Misses represent missed hyphenation points. 



Mistakes Misses 

Patterns (%) (%) 

rgrhyph.tex [1] 25 13 

grhyph.tex [4] 3 16 

GRPhyph.tex (this work) - 3 



monotonic Modern Greek and Ancient Greek for usage with fl [25]. Using an- 
other Perl script, Peter Heslin [26] adapted the hyphenation patterns for Ancient 
Greek for the ibycus package, which can be used for typesetting texts obtained 
from the Thesaurus Linguae Graecae. 

5 Conclusions 

The hyphenation patterns created by the author for Ancient and Modern Greek 
are indeed superior to those previously found on CTAN. Nonetheless, the pat- 
terns are presently under revision to eliminate a few minor mistakes. The author 
anticipates that the improved patterns will be released in CTAN very soon - 
probably before the TUG 2004 conference. Hopefully, these patterns will shortly 
after migrate into ibycus and 17, and they will become the default Greek hyphen- 
ation patterns in whatever system/package becomes the successor of TgX. 
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Abstract. The Deseret Alphabet was an orthographical reform for En- 
glish, promoted by the Church of Jesus Christ of Latter-day Saints (the 
Mormons) between about 1854 and 1875. An offshoot of the Pitman 
phonotypy reforms, the Deseret Alphabet is remembered mainly for 
its use of non-Roman glyphs. Though ultimately rejected, the Deseret 
Alphabet was used in four printed books, numerous newspaper articles, 
several unprinted book manuscripts, journals, meeting minutes, letters 
and even a gold coin, a tombstone and an early English-to-Hopi vocabu- 
lary. This paper reviews the history of the Deseret Alphabet, its Unicode 
implementation, fonts both metal and digital, and projects involving the 
typesetting of Deseret Alphabet texts. 



1 Introduction 

The Deseret Alphabet was an orthographical reform for English, promoted by 
the Church of Jesus Christ of Latter-day Saints (the Mormons) between about 
1854 and 1875. While the Deseret Alphabet is usually remembered today as an 
oddity, a strange non-Roman alphabet that seemed doomed to failure, it was in 
fact used on and off for 20 years, leaving four printed books (including The Book 
of Mormon), numerous newspaper articles, several unprinted book manuscripts 
(including the entire Bible), journals, meeting minutes, letters and even a gold 
coin and a tombstone. There is also growing evidence that the Deseret Alphabet 
was experimentally used by some Mormon missionaries to transcribe words in 
Spanish, Shoshone, Hopi and other languages. 

A number of historians [19,11,20,21,4,1,22,6] have analyzed the Deseret 
Alphabet, which was justly criticized by typographers [21, 31], but what is often 
overlooked is the corpus of phonemically written documents, which are poten- 
tially interesting to both historians and linguists. Because few people, then or 
now, can be persuaded to learn the Alphabet, the majority of the documents 
have lain unread for 140 years. For example, in December of 2002, an “Indian 
Vocabulary” of almost 500 entries, written completely in the Deseret Alphabet, 
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Fig. 1. On 24 March 1854 the newly adopted Deseret Alphabet was first printed, 
probably using wooden type, and presented to the Board of Regents of the Deseret 
University. Although this rare flier is undated, it matches the 38-letter Alphabet as 
copied into the journal of Regent Hosea Stout on that date [30]. Utah State Historical 
Society. 

was finally identified as being English-to-Hopi, being perhaps the oldest written 
record of the Hopi language. 

This paper will proceed with a short history of the Deseret Alphabet, putting 
it in the context of the Pitman phonotypy movement that inspired it from 
beginning to end^; special emphasis will be placed on the variants of the Al- 
phabet used over the years, and on the cutting and casting of historical fonts. 
Then I will review some modern digital fonts and the implementation of the 
Deseret Alphabet in Unicode, showing how some honest mistakes were made 
and how the results are still awkward for encoding and typesetting some of 
the most interesting historical documents. Finally, I will show how I have used 
a combination of XML, DTeX, the TIPA package and my own METflFONT- 
defined [16,10] desalph font to typeset a critical edition of the English-to-Hopi 
vocabulary, and related documents, from 1859-60. 

2 The Pitman Reform Context 

2.1 The Pitman Reform Movements 

To begin, it is impossible to understand the Deseret Alphabet without knowing 
a bit about two nineteenth-century orthographic reformers, Isaac Pitman (1813- 
1897) and his younger brother Benn (1822-1910). The Mormon experiments in 

^ Parts of this paper were first presented at the 22““^ International Unicode Conference 
in San Jose, California, 11-13 September 2002 [6]. 
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Fig. 2. Early Pitman phonography. 



orthographical reform, too often treated as isolated aberrations, were in fact 
influenced from beginning to end by the Pitman movements, at a time when 
many spelling reforms were being promoted. 



Pitman Shorthand or Phonography. There have been hundreds of systems 
of stenography, commonly called shorthand, used for writing English; but Isaac 
Pitman’s system, first published in his 1837 Stenographic Sound-hand and 
called “phonography”^, was soon a huge success, spreading through the English- 
speaking world and eventually being adapted to some fifteen other languages. 
Modern versions of Pitman shorthand are still used in Britain, Canada, and in 
most of the cricket-playing countries; in the USA it was taught at least into the 
1930s but was eventually overtaken by the Gregg system. 

The main goal of any shorthand system is to allow a trained practitioner, 
called a “reporter” in the Pitman tradition, to record speech accurately at 

^ In 1839 he wrote Phonography, or Writing by Sound, being also a New and Improved 
System of Short Hand. 
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speed, including trial proceedings^, parliamentary debates, political speeches, 
sermons, etc. Pitman’s phonography, as the name implies, differs from most 
earlier systems in representing the distinctive sounds of English (what modern 
linguists call phonemes) rather than orthographical combinations. Simplicity 
and economy at writing time are crucial: consonants are reduced to straight 
lines and simple curves (see Figure 2). The “outline” of a word, typically just a 
string of consonants, is written as a single connected stroke, without lifting the 
pen. Voiced consonants are written as thick lines, their unvoiced counterparts 
as thin lines, which requires that a Pitman reporter use a nib pen or soft pencil. 
Vowels are written optionally as diacritical marks above and below the consonant 
strokes; one is struck by the similarities to Arabic orthography. In advanced 
styles, vowels are left out whenever possible, and special abbreviation signs are 
used for whole syllables, common words, and even whole phrases. 



Pitman Phonotypy. Pitman became justifiably famous for his phonography. 
With help from several family members, he soon presided over a lecturing and 
publishing industry with a phenomenal output, including textbooks, dictionaries, 
correspondence courses, journals, and even books published in shorthand, includ- 
ing selections from Dickens, the tales of Sherlock Holmes, Gulliver’s Travels, 
Paradise Lost, and the entire Bible. But while phonography was clearly useful, 
and was both a social and financial success. Pitman’s biographers [25, 24, 2] make 
it clear that his real mission in life was not phonography but phonotypy’^, his 
philosophy and movement for reforming English orthography, the everyday script 
used in books, magazines, newspapers, personal correspondence, etc. 

The first Pitman phonotypy alphabet for which type was cast was Alphabet 
No. 11, demonstrated proudly in The Phonotypic Journal of January 1844 (see 
Figure 3). Note that this 1844 alphabet is bicameral, sometimes characterized as 
an alphabet of capitals; that is, the uppercase and lowercase letters differ only 
in size. The letters are stylized, still mostly recognizable as Roman, but with 
numerous invented, borrowed or modified letters for pure vowels, diphthongs, 
and the consonants /0/, /9/, /J/, /s/, /f/, /<^/ and /g/®. 

® In modern parlance we still have the term court reporter. 

^ According to one of Pitman’s own early scripts, which indicates stress, he pronounced 
the word /fo'notipi/. 

® To provide a faithful representation of original Pitman and Deseret Alphabet texts, 
I adopt a broad phonemic transcription that uses, as far as possible, a single 
International Phonetic Alphabet (IPA) letter for each English phoneme [12]. Thus 
the affricates C and 9 are transliterated as the rarely used IPA /tf/ and /(I 5 / letters, 
respectively, rather than the sequences /tj/ and /d 3 / or even the tied forms /tj/ 
and /d 3 /. The diphthongs are shown in IPA as a combination of a nucleus and a 
superscript glide. The Deseret Alphabet, and the Pitman-Ellis 1847 alphabet which 
was its phonemic model, treat the /■’u/ vowel in words like mule as a single diphthong 
phoneme; see Ladefoged [17] for a recent discussion and defense of this practice. 
Although in most English dialects the vowels in mate and moat are diphthongized, 
the Deseret Alphabet follows Pitman in treating them as the simple “long vowels” 
/e/ and /o/. 




72 



Kenneth R. Beesley 



ADOUF.SS TO Till'. A1 F.M lUvIlS Ol' TIIF. COUURSIMIN DIM; SOCIKTS 

ADllC'S 

TW AU MFMUUKZ OV AU “ FUN LIOUAF 1 K KOUF.Sl'ON Dill SUS.t'hTI,” 
ANl) AF SVBSKHABURZ TM AT I'lINCTIK FUNT. 

Dm FiniNDZ, — It iz wia i'ui'zuhak:!!. filiuz ov nu o'uDixnu k.\nd 

AAT A ADIlll's HI IN F’l INO'tI 1>1, ANT) ALS OFI’R UI AF tUZl/l.T OV AF I'UUS’I' 

i;ksim;'rimj;nt mfv) wia ah KyNT invie urn lirura'liti iiaz r,Ni:i5:ii.D 
MI TH FRUVa'd. Tm UI AVIL FlUCUR FJIZ LIIIC, AZ UHII, FNDFi: Diva'n 
?1!o'vID 1;NS, Al INTllll DIIISFIIZ OV A KORD'kT MUD OV UATIIl AND 
FRTNTTII ; AI INSTRUKTFRZ OV AF St'vII.AZI) UFRLD IN AF TUU 
PRI'nSTFRLZ ov AAT ART IIWIC IZ AF ilFNSl'RIII OV SI Vl I.IZf'nI N . 
Al IAIa'nSIFFTFRZ OV at infant AIANl) from AF tiOl.UlFF.NZ OV AD 
FRFZFNT SISTFAI OV OROO'OUAFI : AND Al F.'hlVETFRZ OV AF CiiFT AI As 
OV MANKA'nD FROAI AF LIIFSTDFFOS OV I'CNUUANS AND SlUTF KSTl'lf s 
TM AF FU’.ZFRZ OV SAFNS, AND AT' DILA'tS OV VFRCUI. 



Fig. 3. In January 1844, Isaac Pitman proudly printed the first examples of his 
phonotypy. This Alphabet No. 11, and the five experimental variants that followed 
it, were bicameral, with uppercase and lowercase characters distinguished only by size. 



The goals of general spelling reform, to create a new “book orthography”, 
are quite different from those of shorthand. While shorthand is intended for 
use by highly trained scribes, a book orthography is for all of us and should 
be easily learned and used. Where shorthand requires simplicity, abbreviation 
and swiftness of writing, varying with the reporter’s skill, a book orthography 
should aim for orthographical consistency, phonological completeness and ease of 
reading. Finally, a book orthography must lend itself to esthetic typography and 
easy typesetting; Pitman’s phonographic books, in contrast, had to be engraved 
and printed via the lithographic process®. 

Pitman saw his popular phonography chiefly as the path leading to phono- 
typy, which was a much harder sell. His articles in the phonographic (shorthand) 
journals frequently pushed the spelling reform, and when invited to lecture 
on phonography, he reportedly managed to spend half the time talking about 
phonotypy. Throughout the rest of his life, Pitman proposed a long succession 
of alphabetic experiments, all of them Romanic, trying in vain to find a winning 
formula. 



Starting in 1873, Pitman succeeded in printing phonography with movable type, but 
many custom outlines had to be engraved as the work progressed. 
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Pitman’s phenotypic publications include not only his phenotypic journals 
but dozens of books, including again the entire Bible (1850). But in the end, 
phonotypy never caught on, and the various phenotypic projects, including the 
constant cutting and casting of new type, were “from first to last a serious 
financial drain” [2]. In 1894, a few years before his death. Pitman was knighted 
by Queen Victoria for his life’s work in phonography, with no mention made of 
his beloved phonotypy. 

Today Pitman phonotypy is almost completely forgotten, and it has not yet 
found a champion to sponsor its inclusion in Unicode. But Pitman was far from 
alone - by the 1880s, there were an estimated 50 different spelling reforms under 
consideration by the English Spelling Reform Association. This was the general 
nineteenth-century context in which the Deseret Alphabet was born; lots of 
people were trying to reform English orthography. 

2.2 The Mormons Discover the Pitman Movement 

The Church of Jesus Christ of Latter-day Saints was founded in 1830 in upstate 
New York by Joseph Smith, a farm boy who claimed to have received a vision 
of God the Father and Jesus Christ, who commanded him to restore the true 
Church of Christ. He also claimed that he received from an angel a book, 
engraved on golden plates, which he translated as The Book of Mormon. His 
followers, who revered him as a prophet, grew rapidly in number, and soon, 
following the western movement and spurred by religious persecution, they mi- 
grated from New York, to Ohio, to Missouri and then to Illinois, where in 1839 
they founded the city of Nauvoo on the Mississippi River. 

Missionary work had started immediately, both at home and abroad, and 
in 1837, the same year that Pitman published his Stenographic Sound-hand, a 
certain George D. Watt was baptized as the first Mormon convert in England. 
Despite an unpromising childhood, which included time in a workhouse, young 
George had learned to read and write; and between the time of his baptism and 
his emigration to Nauvoo in 1842, he had also learned Pitman phonography. 
The arrival of Watt in Nauvoo revolutionized the reporting of Mormon meeting 
minutes, speeches and sermons. Other converts flowed into Nauvoo, so that by 
1846 it had become, by some reports, the largest city in Illinois, with some 20,000 
inhabitants. 

But violence broke out between the Mormons and their “gentile” neighbors, 
and in 1844 Joseph Smith was assassinated by a mob. In 1845, even during 
the ensuing confusion and power struggles. Watt gave phonography classes; one 
notable student was Mormon Apostle Brigham Young. Watt was also President 
of the Phonographic Club of Nauvoo [1]. In addition to phonography. Watt was 
almost certainly aware of the new phonotypy being proposed by Pitman, and it 
is likely that he planted the idea of spelling reform in Brigham Young’s mind at 
this time. 

In 1846, Watt was sent on a mission back to England. The majority of the 
Church regrouped behind Brigham Young, abandoned their city to the mobs, 
and crossed the Mississippi River to spend the bleak winter of 1846-47 at Winter 
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Quarters, near modern Florence, Nebraska. From here Brigham Young wrote to 
Watt in April 1847^: 

It is the wish of the council, that you procure 200 lbs of phonotype, or 
thereabouts, as you may find necessary, to print a small book for the 
benefit of the Saints and cause same to be forwarded to Winter Quarters 
before navigation closes, by some trusty brother on his return, so that 
we have the type to use next winter. 

The “phonotype” referred to is the actual lead type used for Pitman phonotypy. 
The Saints, meaning the members of the Church, were still in desperate times - 
600 would die from exposure and disease at Winter Quarters - and while there 
is no record that this type was ever delivered, it shows that the Mormons’ first 
extant plans for spelling reform involved nothing more exotic than an off-the- 
shelf Pitman phonotypy alphabet. 

It is not known exactly which version of Pitman phonotypy Young had in 
mind; Pitman’s alphabets went through no fewer than 15 variations between Jan- 
uary 1844 and January 1847, and the isolated Mormons were likely out of date. 
In any case. Pitman’s alphabets had by this time become more conventionally 
Roman. Alphabet No. 15 (see Figure 4), presented in The Phonotypic Journal 
of October 1844®, marked Pitman’s abandonment of the bicameral “capital” 
alphabets, and his adoption of alphabets that had distinguished uppercase vs. 
lowercase glyphs, which he called “lowercase” or “small letter” alphabets. 

The Mormons started leaving Winter Quarters as soon as the trails were 
passable, and the first party, including Brigham Young, arrived in the valley of 
the Great Salt Lake in July of 1847, founding Great Salt Lake City. Mormon 
colonists were soon sent throughout the mountain west. They called their new 
land Deseret, a word from The Book of Mormon meaning honey bee. In response 
to Mormon petitions to found a State of Deseret, Congress established instead 
a Territory of Utah, naming it after the local Ute Indians. In spite of this 
nominal rebuff, Brigham Young was appointed Governor, and the name Deseret 
would be applied to a newspaper, a bank, a university, numerous businesses and 
associations, and even a spelling-reform alphabet. The name Deseret, and the 
beehive symbol, remain common and largely secularized in Utah today. 



3 The History of the Deseret Alphabet 

3.1 Deliberations: 1850 1853 

Education has always been a high priority for the Mormons, and on 13 March 
1850 the Deseret University, now the University of Utah, was established under 
a Chancellor and Board of Regents that included the leading men of the new 
society. Actual teaching would not begin for several years, and the first task 
given to the Regents was to design and implement a spelling reform. 

^ The Latter-day Saints’ Millennial Star, vol. 11, 1847, p. 8. 

® The Phonotypic Journal, vol. 3, no. 35, Oct. 1844. 
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Fig. 4. Alphabet No. 15 appeared in October 1844 and was the first of Pitman’s 
“lowercase” or “small letter” alphabets, employing separate glyphs for uppercase and 
lowercase letters. 

Although serious discussion of spelling reform began in 1850, I will jump 
ahead to 1853, when the Regency met regularly in a series of well-documented 
meetings leading to the adoption of the Deseret Alphabet. Throughout that year, 
the Regents presented to each other numerous candidate orthographies ranging 
from completely new alphabets, to Pitman shorthand, to minimal reforms that 
used only the traditional 26-letter Roman alphabet with standardized use of 
digraphs. The discussion was wide open, but by November of 1853, it was clear 
that the “1847 Alphabet” (see Figure 5), a 40-letter version backed jointly by 
Isaac Pitman and phonetician Alexander J. Ellis [15], was the recommended 
model. The 1847 Alphabet was presented to the Board in a surviving chart (see 
Figure 6) and the meeting minutes were even being delivered by reporter George 
D. Watt in the longhand form of this alphabet. 
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Fig. 5. The 1847 Alphabet of Alexander J. Ellis and Isaac Pitman as it appeared in 
Pitman’s 1850 Bible. This alphabet was the main phonemic model for the Deseret 
Alphabet in late 1853. The Board of Regents of the Deseret University almost adopted 
a slightly modihed form of this alphabet, but they were persuaded, at the very last 
moment, to change to non-Roman glyphs. Compare the layout of this chart to that of 
the Deseret Alphabet charts in the books of 1868-69 (see Figure 17). 

Brigham Young, President of the Church of Jesus Christ of Latter-day Saints 
and Governor of the Territory of Utah, took a personal interest in the 1853 
meetings, attending many and participating actively. On the 22'^'^ and 23’’'^ of 
November, he and the Regents adopted their own modified version of the 1847 
Alphabet, with some of the glyphs modified or switched, and names for the 
letters were adopted. A couple of Pitman letters were simply voted out, namely 
those for the diphthongs /o^/ and /■’u/, which are exemplified with the words 
oyster and use in the 1847 chart. The result was a 38-letter alphabet, still very 
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Fig. 6 . In November 1853, Parley P. Pratt presented “Pitman’s Alphabet in Small 
Letters” to the Board of Regents in the form of this chart. These are in fact just the 
lowercase letters of the famous 1847 Alphabet devised by Isaac Pitman and Alexander 
J. Ellis. More stable than Pitman’s other alphabets, it lasted several years and was used 
to print a short-lived newspaper called The Phonetic News (1849), the Bible (1850), 
and other books. LDS Church Archives. 

Pitmanesque and Romanic. For the second time - the first was in 1847 - the 
Mormons were about to embark on a Pitman-based spelling reform. 

However, all plans were turned upside-down by the sudden arrival of Willard 
Richards at the meeting of 29 November 1853. Richards, who was Second Coun- 
selor to Brigham Young, was gravely ill, had not attended the previous meetings, 
and was not up to date on the Board’s plans. But when he saw the Board’s new 
Romanic alphabet on the wall, he could not contain his disappointment. The 
following excerpts, shown here in equivalent IPA to give the flavor of George D. 
Watt’s original minutes, speak for themselves: 

wi wont e nju ka%d dv aelfaebet, diferiq from Si kompa'^nd mes dv 
stAf Apon Saet Jit.... Soz kasraekterz me bi emploid m impruvig Si irighj 
orGograefi, So aet Si sem ta^m, it iz aez a^ haev SAmta^mz sed, it simz 
la^k pAtiq nju wa% mtu old botlz.... a^ aem mkla-^nd tu 0iqk hwen wi 
haev riflekted loqer wi Jael stil mek SAm aedvaens Apon Saet aelfaebet, 
aend prhaeps 0ro aewe d 1 kaeraekterz Saet ber mAtf rizemblens tu Si iriglij 
kaeraekters, aend mtrodjus aen aelfaebet Saet iz oric^mael, so far aez wi no, 
aen aelfaebet entaTli diferent from em aelfaebet m jus®. 

® “We want a new kind of alphabet, differing from the compound mess of stuff 
upon that sheet. . . . Those characters may be employed in improving the English 
orthography, though at the same time, it is as I have sometimes said, it seems like 
putting new wine into old bottles. ... I am inclined to think when we have reflected 
longer we shall still make some advance upon that alphabet, and perhaps throw away 
all characters that bear much resemblance to the English characters, and introduce 
an alphabet that is original, so far as we know, an alphabet entirely different from 
any alphabet in use.” 




78 



Kenneth R. Beesley 



Some objections were tentatively raised. It was pointed out that the key 
committee had been instructed to keep as many of the traditional Roman letters 
as possible, and that Brigham Young himself had approved the alphabet and had 
already discussed ordering 200 pounds of type for it. Richards then attenuated 
his criticism a bit, but renewed his call for a complete redesign, waxing rhetorical: 

whnt haev ju gend ba-’ Si aelfaebet on Saet kard a-^ aesk ju. Jo mi wau 
aRem, kasn ju point a'^t Si ferst asdvaentec^ Saet ju haev gend over Si old 
wAn? ... hwot haev ju gend, ju haev Si sem old aelfaebet over aegen, onli a 
fju aedijnael marks, aend Se onli mistifa^ it mor, aend mor.^*^ 

Richards believed fervently that the old Roman letters varied too much in 
their values, that no one would ever agree on their fixed use, and that keeping 
them would just be a hindrance; a successful, lasting reform would require 
starting with a clean slate. He also argued for economy in writing time, paper 
and ink. These arguments anticipated those advanced by George Bernard Shaw 
in the 20*^^ century to support the creation of what is now known as the Shaw 
or Shavian Alphabet [28, 18]^^. 

Brigham Young and the Board of Regents were persuaded, the Board’s 
modified Pitman alphabet was defenestrated, and the first version of a new 
non-Roman alphabet was adopted 22 December 1853, with 38 original glyphs 
devised by George D. Watt and perhaps also by a lesser-known figure named 
John Vance. The Deseret Alphabet was born. 



3.2 Early Deseret Alphabet: 1854 1855 

In Salt Lake Gity, the Deseret News announced the Alphabet to its readers 19 
January 1854: 

The Board of Regents, in company with the Governor and heads of 
departments, have adopted a new alphabet, consisting of 38 characters. 

The Board have held frequent sittings this winter, with the sanguine 
hope of simplifying the English language, and especially its Orthography. 
After many fruitless attempts to render the common alphabet of the day 
subservient to their purpose, they found it expedient to invent an entirely 
new and original set of characters. 

These characters are much more simple in their structure than the usual 
alphabetical characters; every superfluous mark supposable, is wholly 
excluded from them. 

The written and printed hand are substantially merged into one. 

“What have you gained by the alphabet on that card I ask you. Show me one item, 
can you point out the first advantage that you have gained over the old one? ... What 
have you gained, you have the same old alphabet over again, only a few additional 
marks, and they only mystify it more, and more.” 
http : //www. shavian.org/ 
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Type of some kind, almost certainly wooden^^, was soon prepared in Salt Lake 
City, and on 24 March 1854 a four-page folded leaflet with a chart of the Deseret 
Alphabet was presented to the Board (see Figure 1). In this early 1854 version of 
the Alphabet, we And 38 letters, the canonical glyphs being drawn with a broad 
pen, with thick emphasis on the downstrokes, and light upstrokes and flourishes. 
The short- vowel glyphs are represented smaller than the others. 

S 7, 7. • 

^ ir ^ . 

^ it ^ -yo 

O*^ c4t i ^ /»« , yt€i 

A4, /^6> 7 . 

Y ir 7iW\ M . 

%-iy' 4*^ i'/ a 'hip-tp i[- 97' . 

t ^ . 2 . S. y, r , *f , /o , // , t'l , fi. //*, ftp, ty , 

^ £tf^ cr*^ ^ tJcJ /aA' 

£l»X>y %yiy- /» <» Yy /U^t, '\y ^ >4" Y •!>*•»-/ / Y»»; 

Y/ 't/y- 3 cot-t *%■€ yi , yi-€> x»/t^ ycry*^ 7-i . 

*} ■h eytp'y *W Y £ut>>^j >yif *>y5 > m tv- 

-Tf tij Wt 7f t ify- , ooi-o tr t\) t 

'it*- *-©7 At hf S ^ Vt ynf 7r **'€ 

Fig. 7. Extract from the minutes of a Bishops’ meeting, 6 June 1854, concerning the 
support of the poor. These minutes, written in a cursive, stenographic style, were 
prepared by George D. Watt and addressed directly to Brigham Young. LDS Church 
Archives. 



George D. Watt was the principal architect of the Deseret Alphabet and, 
judging by surviving documents, was also the first serious user. Watt was a 
Pitman stenographer, and the early documents (see Figure 7) are written in a 
distinctly stenographic style^^. Watt drew the outline of each word cursively, 
without lifting the pen. Short vowels, shown smaller than the other glyphs in 
the chart, were incorporated into the linking strokes between the consonants; 
thus vowels were usually written on upstrokes, which explains their canonical 
thin strokes and shorter statures in the first chart. The writer had to go back 
and cross the t vowels after finishing the outline; and often short vowels were 
simply left out. 

The demands of cursive writing seem to have influenced the design of several 
of the letters. In particular, the fussy little loops on the 0 (/d/), 8 (/s/), © 
(/g/), 6 (/o/) and 8 (/a'"/) were used to link these letters with their neighbors. 
Watt also combined consonants together with virtuosity, “amalgamating” them 
together to save space, but at the expense of legibility. Another lamentable 

Deseret News, 15 August 1855. 

James Henry Martineau was another early cursive writer. 
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9 3 8 6 0 

* ^ AH /■ 

® t J n 

j > J, 4 1 

01 ow 

® W V f 7 

tt K P 

a n a c q 

T 0 CW 

Q © P € ^ 

1 CA r tTH 

S 6 D § 

TW & Z KM .'KOC 

r I 9 1 M 

> L. M H SC 



Fig. 8. Remy and Brenchley almost certainly copied this chart from an almost identical 
one in W.W. Phelps’ Deseret Almanac of 1855. With the addition of letters for /oY and 
/^u/, this 40-letter version of the Deseret Alphabet had the same phonemic inventory 
as the Pitman-Ellis 1847 Alphabet. 



characteristic of the early style was the inconsistent use of letters, sometimes to 
represent their phonemic value and sometimes to represent their conventional 
name. Thus Watt writes people as the equivalent of /ppl/, expecting the reader 
to pronounce the first p-letter as /pi/, that being the letter’s conventional name 
when the alphabet is recited. Similarly, Watt can spell being as the equivalent 
of just /bq/, the letters having names pronounced /bi/ and /iq/, respectively. 
While probably seen by shorthand writers as a clever way to abbreviate and 
speed their writing, the confusion of letter names and letter values is a mistake 
in any book orthography. 

Like Isaac Pitman, the Mormons could not resist experimenting with their 
new alphabet, changing both the inventory of letters and the glyphs. The 1854 
alphabet was almost immediately modified, substituting new glyphs for /i/ and 
/a'"/ and adding two new letters for the diphthongs /o^/ and /M/, making a 
40-letter alphabet as printed in the 1855 Deseret Almanac of W.W. Phelps. 
This chart was almost certainly the one copied by Remy and Brenchley [27] who 
visited Salt Lake City in 1855 (see Figure 8)^'^. 



14 



For yet another chart of this version of the Alphabet, see Benn Pitman’s The 
Phonographic Magazine, 1856, pp. 102-103. 
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Watt apparently believed that the same basic alphabet could serve for both 
stenography and everyday orthography, or as the Deseret News, cited above, 
put it, “The written and printed hand are substantially merged into one.” This 
was in fact an early goal of phonotypy, but it was soon abandoned by Pitman 
as impractical [15]. The retention of this old idea contributed to making the 
Deseret Alphabet an esthetic and typographical failure. 

One of the fundamental design problems in the Alphabet was the elimination 
of ascenders and descenders. This was done in a well-intentioned attempt to 
make the type last longer - type wears out during use, and the ascenders and 
descenders wear out first - but the lamentable result was that all typeset words 
have a roughly rectangular shape, and lines of Deseret printing become very 
monotonous. Some of the glyphs, in particular 8 and 0, are overly complicated; 
and in practice writers often confused the pairs 0 vs. © and 0 vs. D. These 
fundamental design problems need to be distinguished from the font-design 
problems, which will be discussed below. 



3.3 The 1857 St. Louis Font 

The reform was moving a bit slowly. On 4 February 1856 the Regents appointed 
George D. Watt, Wilford Woodruff, and Samuel W. Richards to prepare manu- 
scripts and arrange for the printing of books. The journals of Richards and 
Woodruff show that they went at it hammer and tongs, working on elementary 
readers and a catechism intended for teaching religious principles to children. 
The next step was to get a font made. 

There are references to an attempt, as early as 1855, to cut Deseret Alphabet 
punches right in Utah, by a “Brother Sabins” but there is as yet no evidence 
that this project succeeded. In 1857, Erastus Snow was sent to St. Louis to 
procure type, engaging the services of Ladew & Peer, which was the only foundry 
there at the time [31]. But Snow abandoned the type and hurried back to Utah 
when he discovered that President Buchanan had dispatched General Albert 
Sydney Johnston to Utah with 2500 troops from Fort Leavenworth, Kansas, to 
put down a reported Mormon rebellion and install a new non-Mormon governor. 
The news of “Johnston’s Army” reached Salt Lake Gity 24 July 1857, when the 
alleged rebels were gathered for a picnic in a local canyon to celebrate the tenth 
anniversary of their arrival in Utah. In the ensuing panic. Salt Lake Gity and the 
other northern settlements were abandoned, and 30,000 people packed up their 
wagons and moved at least 45 miles south to Provo. The territorial government, 
including records and the printing press, were moved all the way to Fillmore in 
central Utah. While this bizarre and costly fiasco, often called the Utah War or 
Buchanan’s Blunder, was eventually resolved peacefully, it was another setback 
to the Deseret Alphabet movement. 

By late 1858, the Utah War was over, the St. Louis type had arrived in 
Salt Lake Gity, and work recommenced. It is very likely that only the punches 

The Latter-day Saints’ Millennial Star, 10 November 1855. The reference is probably 

to John Sabin (not Sabins), who was a general mechanic and machinist. 
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and matrices were shipped to Utah^®, and that the Mormons did the actual type 
casting themselves. The children’s texts prepared by the committee of Woodruff, 
Richards and Watt had been lost; unfazed, Brigham Young told Woodruff to 
“take hold with Geo. D. Watt and get up some more” The first use of the new 
type was to print a business card for George A. Smith, the Ghurch Historian. 
The stage was now set for the revival of the Deseret Alphabet reform in 1859-60. 

3.4 The Revival of 1859 1860 

Sample Articles Printed in the Deseret News. The period of 1859-60 
was a busy and productive one for the Deseret Alphabet. The type was finally 
available, and on 16 February 1859 the Deseret News printed a sample text from 
the Fifth Ghapter of Matthew, the Sermon on the Mount. Similar practice texts, 
almost all of them scriptural, appeared almost every week to May 1860. Despite 
this progress, everyone involved was extremely disappointed with the St. Louis 
font, which was crudely cut and ugly by any standards. Young felt that the poor 
type did as much as anything to hold back the reform. 

The 1859 Alphabet as printed in the Deseret News (see Figure 9) had reverted 
to 38 letters, lacking dedicated letters for the diphthongs /o^/ and /'^vl/ , which 
had to be printed with digraphs; but the Deseret News apologized for the lack of 
a /M/ letter and promised a correction as soon as a new punch could be cut^®. 

In 2002 I found the punches for the 1857 St. Louis font in the LDS Ghurch 
Archives (see Figure 10). There proved to be only 36 punches in each of three 
sizes, but investigation showed that they were originally intended to support a 
40-letter version of the Alphabet. The trick was the double use of four of the 
punches, rotating them 180 degrees to strike a second matrix. Thus the punch 
for 1 also served to strike the matrix for L; the punch for ”1 also served for L; and 
similarly for the pairs 1-f and J-1. The sets include punches for the /o^/ and 
/lu/ diphthongs, being 6 and ®, respectively, but these glyphs had apparently 
fallen out of favor by 1859 and were not used in the Deseret News. 

Handwritten Deseret Alphabet in 1859—60. Brigham Young directed his 
clerks to use the Alphabet, and the history or biography of Brigham Young was 
kept in Deseret Alphabet at this time. Another surviving text from this period 
is the financial “Ledger G”, now held at Utah State University (see Figure 12). 
This ledger was probably kept by clerk T.W. Ellerbeck who later wrote [19], 
“During one whole year the ledger accounts of President Young were kept by 
me in those characters, exclusively, except that the figures of the old style were 
used, not having been changed.” 

The Ledger G alphabet has 39 letters, including the glyph 9 for f^n/ but 
using a digraph for /o^/. The Ledger abandons the Alphabet in May of 1860, at 

Deseret News, 16 February 1859. 

Journal History, 20 November 1858. The journal of Wilford Woodruff for 22 Novem- 
ber 1858 indicates that the manuscripts were soon found. 

The Deseret News also promised a new letter for the vowel in air, which was a highly 
suspect distinction made in some Pitman alphabets. 
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Ud*. Sb..rt. T h L Cth 

a e + n P ; tlie 
3 a j a b » s 

8 all 3 t 6 z 

• O an ^ a d D esh 

O o r C clio s zhe 

O DO 1 q g 4 ur 

<(, i O k t 1 

8 ow o ga 5 m 

I u woo f f ii n 

I 

j V ye e V n eng 

j aio 70<t9r<i, i3q 

j S8 J. 83 rvi"! VO, ■»Ji va 5rsi 

I j»ia m ao4'i jojs ; PvJ4 

I «+ sn+4+1 8JL, +f va ^!4 '.vjq 
ao-m jojh, va o>ihvgi +hTJ4i<q 
w+ o+iian 'jc 'fJ €^5 TSJitfOcf or? 

>183 n+ aJ^^J.'60 r'ns <»nj8qj88, 

■jf>ii va n a+ u»jun f<t>.;9 v ©4 
8+S6, ^f>n va 93 Tje P3 LvJ8 in 

LJ9 OvJfl, TO 130JL JU3 •«+ 

8 +'i6 vie ■»t ur<tta, to +6 9 ai+ qs 
836 J'lfl qq OU86 P4vJ9 Gb rS- 
4<KCr88t8; V3, A 83 r8qs VO, or? 

>180 f34 8vjq, >180 L3 >18J>0 J64t 
I 8+8, Tu+c a6+L+ arL a+8jq v6, . 

Fig. 9. The Deseret News printed sample articles in the Deseret Alphabet in 1859-60, 
and again in 1864, using the crude St. Louis type of 1857. This article, of which only 
a portion is shown here, appeared in the issue of 30 November 1864, vol. XIV, no. 9, 
which also included reports of the fall of Atlanta, Georgia to General Sherman during 
the American Civil War. 



the same time that the Deseret News stopped printing sample articles, and the 
Deseret text was at some point given interlinear glosses in standard orthography. 

My own favorite document from this era is the Deseret Alphabet journal of 
Thales Hastings Haskell [29], kept from 4 October 1859 to the end of that year 
while he and Marion Jackson Shelton^® were serving as Mormon missionaries 



19 



Shelton also kept a journal in 1858-59, in a mix of standard orthography, Pitman 
shorthand, and some Deseret Alphabet. LDS Church Archives. 
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cK 


A 


^ ^ 3^ 


V) 




^ V 


© 


© 


© © © 


© 


© 


© © © 


u 


IJ 


^ IJ u 


/ 


V 


V V V 


'f 


'f 


t H t 



Fig. 10. Some smoke proofs of the 1857 St. Louis punches, found in 2002 in the LDS 
Church Archives. The 0 and ffi glyphs, representing /oY and /^u/, respectively, were 
not used when the Deseret News finally started printing sample articles with the type 
in 1859. 





J4>n>i. 


neq 3, fr48q Ovitn, v, u.^lha y osa a 4aD©ea 


urw sn^'8 


aeh. 


tcKH f04 V+S8 49a V+©8- 


“ 8, 1()L “ 


“ OvJH “ OOh. 


“ 8, 12l ‘• 


‘‘ 4ro “ 4ro 


“ 9, llL “ 


“ J4L ‘ r4L. 


“ 18, 34a “ 


‘‘ aen ‘‘ aan. 



Fig. 11. A portion of the errata sheet, printed in Utah using the St. Louis type of 
1857, for The Deseret First Book of 1868. A much better font was cut for printing the 
readers (see Figure 15), but it was left in the care of Russell Bros, in New York City. 



to the Hopi^°. They were staying in the Third-Mesa village of Orayvi (also 
spelled Oribe, Oraibi, etc.), now celebrated as the oldest continuously inhabited 
village in North America. Haskell used a 40-letter version of the alphabet, like the 
contemporary Deseret News version, but adding ? for /%/ and, idiosyncratically, 
ffi for /dY- The original manuscript is faint and fragile; the following is a sample 
in typeset Deseret Alphabet and equivalent IPA: 



The original journal is in Special Collections at the Brigham Young University 
Library. At some unknown time after the mission, Haskell himself transcribed 
the Deseret text into standard orthography, and this transcription was edited and 
published by Juanita Brooks in 1944 [9]. 
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/ 














Fig. 12. Ledger C of 1859-60, probably kept by T.W. Ellerbeck, illustrates an id- 
iosyncratic 39-letter version of the Deseret Alphabet. There are still cursive links and 
“amalgamations” in this text, but far fewer than in George D. Watt’s early texts of 
1854-55. Interlinear glosses in traditional orthography were added later. Utah State 
University. 



oji ri 110 ajoprsi bo siaita fiatPi uju jua lo o+ia etu9 lo ut kj? kji ua ujt 

01910 

got Ap tuk bekfAst [sic] send statid indiAn went aehed tu oraJh vili<^ 
tu tel 3em Sast wi waer kAmi:g 

In standard orthography, this reads, “Got up, took b[r]eakfast and sta[r]ted [;] 
Indian went ahead to Oribe village to tell them that we were coming.” The 
missing r in breakfast is just an isolated error, but the spelling of /statid/ for 
started is characteristic; Haskell was from North New Salem, Franklin Country, 
Massachusetts, and he dropped his rs after the /a/ vowel [5]. Other writers 
similarly leave clues to their accents in the phonemically written texts. 

Marion J. Shelton was a typical 40-letter Deseret writer from this period, 
using the more or less standard glyphs 0 for /ol/ and 9 for /^u/, as in the 
following letter^^, written shortly after his arrival in Orayvi. 



OHQ 6tU9. h? Djostoo. 

‘loe. 13, 1859. 



attrea a+ni+6, 

I J9 StltM Jir 1J1 J6 9J. flUJltM HWI. V U3 Ua (3J1 frll 8t f8S 
t6 Lt(D 3 mi SOUJt fOL fr t(DP. Ua 00 agir 3 Uat JW fUJir Ua (3J1 asir ua fje 11 SUia SKDItM 
Jt ar9i 8t fjae. Yjsutaa 1 no atjopjsi utK un je 91 tja Ptnae. 4 uju ri tin k irta 

810+t jia Sana 9I8 jlp ji k plo+ atsta 9t puia. X taat j6 k fss a+ei 3 “caoaiir”, j+ r+ui 

9a+, pu 8(Di, jia 3 aj80Ji pu J6 “lao”. (3 aua +t6J9aLta au uita i3u+ potajai X ota 
L3at 83ija fr+8JLP, 3 LtiL ae et80 jia usitt k oji n tis itss utK ti8 fja ti k scdi jia ti8 

13L J1 K 130. so U3 3100 130 atll 801 UtK 8t PtM0Jt6 JW fja 3 9Jtt atJOPJSI. 



21 



Marion Jackson Shelton to George A. Smith and others, 13 November 1859, George 
A. Smith Incoming Correspondence, LDS Church Archives. 
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Xae o+tate aai k 9j+9r'i6 pj+ ctta+j'i, 3 p? ojoe j'lO ojis j'lO tj+sj6, 3 eia 9J'it oai, ir+ote, 
J'lG CtO'i6 UtK U18 J6 iaCJ6, OJ+'i, aa'i6, 9Jtr'i6, +J0 IJlJt, 80UJDJ6 &c. Xae UPie K3 +36. 

VJ+ ur+oDJis 8+ rwj+ e+sw. kj+ ur+o te captt 93otn aunojis jw ajtis je ugl ~ t3 +36 
sr? ojiv jw 8+ yi jatoua ii ajetn, an 8+ ej+t +^+ 119 +^ jw jw twrsi+trs fratjire 

++1 90 otpiLt an i+?put. an, a+rK+ 6 , + djl sa v? >1+081 pet j'la utt fje tr+'ia 90+ jasi 
tae P 008 at tji 1+9 j'la kj'i ua’t fje ato leos 110 JK+. Y?+6, 

9.9.DJL1'i. 

1i 9.L. 89tu +. SjhiLt, +. Oj9aL, 9.9, 9.6. Jha rx+e. 

Here is the same letter transcribed into traditional orthography. 

Oribe Village New Mexico. 

Nov. 13. 1859. 

Beloved Brothers, 

I am sitting on top of my dwelling writing. The way 
we get into our house is through a little square hole in the top. We go down a 
ladder and when we get down we have to stand stooping or bump our heads. 
Yesterday morning I took breakfast with one of my red friends. I went up into 
the third story and seated myself on the floor beside my friend. The lady of 
the house brought a “chahkahpta” [tsaqapta], or earthen jar, full of soup, and 
a basket full of “peek”, (a bread resembling blue wrapping paper folded) The 
old lady seated herself, a little boy also and lastly the cat to its place with its 
head in the soup and its tail on the peek. So we broke peek dipped soup with 
our fingers and had a merry breakfast. 

These Oribes beat the Mormons for children, a few dogs and cats and horses, 
a good many sheep, turkeys, and chickens with lots of peaches, corn, beans, 
melons, and pepper, squashes &c. These things they raise. 

Their workshops [kivas] are underground. Their work is chiefly making blan- 
kets and belts of wool - they raise some cotton, and are not addicted to begging, 
but are very intelligent and industrious indians. 

I write jokingly but truefully [sic]. But, brothers, I shall see you next fall and 
will have learned more about these folks by that time and then we’ll have big 
talks together. Yours, 

M.J. Shelton. 



To G.L.[sic] Smith, R. Bentley, R. Campbell, J.J, J.V. and others. 

Over the years, Shelton proposed a number of modifications to the Deseret 
Alphabet, including the addition of the letter I; its use in the following text^^ 
shows that it was intended to represent the schwa, or neutral vowel, a phoneme 
missing from the standard Deseret Alphabet and from the 1847 Ellis-Pitman 
Alphabet that was its principal model. 

Marion J. Shelton to Brigham Yonng, 3 April 1860, Brigham Young Incoming 
Correspondence, LDS Church Archives. 
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I 0t0 yi 810890 fi LrTitM KJ9 11 +t1 J6 K 01181^ OIPIISI DJ+ILt 8P1I+ 8+ l+teiL K3+ 110 

OAifivia mtt ua upi. an t J9 8jit8Pt0 kji ur» i+Jiit oa+ae I oji 11+1 kj9 11 +ti ti un 
utnit 9ot. j. fje t ts utitiatt ujt atooii. 

This experiment, which never caught on, resulted in a 41-letter Deseret 
Alphabet. The text in equivalent phonemic IPA is the following: 

a-^ did not SAksid m lArniq 3em tu raT sez 3 daensiq komenst Jortli aftor a'^r 
ora-’vol 3er send kontmjad Antil wi left. bAt a^ aem saetisfaM 3aet wi3 propor kardz 
aj ksen Arn 3em tu raT m wau wintar mor. a-^ haev 3 a^s tolarabh wel brokan.^^ 



3.5 The 1860s and the Printed Books 

Most of the enthusiasm for the Deseret Alphabet collapsed in 1860, and by 
1862 it was dead, except in the determined mind of Brigham Young. When 
Superintendent of Common Schools Robert L. Campbell presented Brigham 
Young with a manuscript of a “first Reader” in standard orthography. Young 
rejected it emphatically, insisting that “he would not consent to have his type, 
ink or paper used to print such trash” 

In 1864, the Regents considered adopting the phonotypy of Benn Pitman, 
the brother of Isaac who had established his own Phonographic Institute in 
Cincinnati in 1853, but the ultimate response was a recommitment to the Deseret 
Alphabet; sample Alphabet articles reappeared defiantly in the Deseret News 11 
May 1864 and continued to the end of the year. 

There were in fact several attempts during the 1860s to abandon the Deseret 
Alphabet. In December of 1867^^, the Board of Regents, with Brigham Young, 
resolved unanimously to adopt “the phonetic characters employed by Ben [sic] 
Pitman of Cincinnati, for printing purposes, thereby gaining the advantage of 
the books already printed in those phonetic characters.” However, on 3 February 
1868^®, the Board once again did an about-face, recommitted to the Deseret 
Alphabet and started the serious and expensive work of getting books prepared 
for publication. Apostle Orson Pratt was hired to transcribe The Deseret First 
Book and The Deseret Second Book into the Deseret Alphabet. 

After the disappointment with the crude St. Louis type, the Regents in 1868 
sent their agent D.O. Calder to New York to get better workmanship. Calder 
engaged the firm of Russell Bros^^, which cut and cast an English (14-point) font 
for the project. The new school books (see Figures 14 and 15) were delivered to 

“I did not succeed in learning them to write as the dancing commenced shortly after 
onr arrival there and continued until we left, but I am satisfied that with proper 
cards I can learn them to write in one winter more. I have the ice tolerably well 
broken.” 

Journal History, 22 May 1862. 

Deseret News, 19 December 1867. 

Deseret News, 3 February 1868. 

Russell’s American Steam Printing House, located at 28, 30 and 32 Centre Street, 
New York City, Joseph and Theodore Russell, Props. 
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ae ALFABET 

Ov de Kapitalz and Smel Leterz. 



8 6 

E 9 

a a 

A ^ 

■a q 

e e 

Q o 

(D (D 

I i 

E e 

A a 

a a 

O o 

U u 

U la 



® 3 

CT G 

If Tf 

H II 

Y y 

W w 

H li 

P P 

B b 

T t 

D d 

e g 

J j 

K k 

G g 



F f 

V V 

E t 

a d 

S 8 

Z Z 

5: J 

S 3 

L 1 

R r 

M m 

N n 

W g 



Fig. 13. The 1855 Benn Pitman or American Pitman Alphabet. In 1852, Benn Pitman 
carried the Pitman phonography and phonotypy movement to the United States, 
setting up The Phonographic Institute in Cincinnati in 1853. Whereas Isaac Pitman 
was an incurable tinkerer, constantly modifying his alphabets, brother Benn recognized 
the virtues of stability. 



Salt Lake City in late 1868, at which time Orson Pratt had already turned his 
dogged energy to the transcription of The Book of Mormon. 

In 1869, Pratt was sent as the agent to New York, to supervise the printing 
of The Book of Mormon. He too chose Russell Bros, and had a font of Long 
Primer (10-point) type cut and cast for the body of the text^®. The bicameral 

Small Pica (11-point) type was also considered and, unfortunately, rejected. With 
the inherent design problems of the Deseret Alphabet, the Long Primer type is too 
small for comfortable reading. 
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Fig. 14. In 1868, The Deseret First Book, shown here, and The Deseret Second Book 
were printed by Russell Bros, of New York and shipped to the Territory of Utah. The 
print run for each book was 10,000 copies. 



nature of the Deseret Alphabet allowed him to save some money by using the 
lowercase letters of the existing English (14-point) font as the uppercase letters 
of the Long Primer (10-point) font. Pratt also had fonts prepared in the Great 
Primer (18-point) and Double English (28-point) sizes to serve in headings and 
titles. Not surprisingly, Pratt complained that the three unlucky compositors 
assigned to the project were making “a great abundance of mistakes in setting 
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tljsatji Pr+si 8so. 15 



XXIII. 




Ua f 46 8 -tja jha oe. Da te 8 | 

08, j'lO are >ivji o+o fujj'i uia otto fr-t. Da , 

f J6 8 tlteh 08P. WtK 8 fU^I P8«. Dj. P85(J+ | 

086 X 08P 10 oa. X 08 0166 T* otio. Ua 
cri'i X otto Nhfi 080 tirut. 



UM XXIV. I 

X fvio t6 th X 1J||. f 8 PJ1 X fvio t6 ! ! 

Lla 08 ■ 8 -a X tjo ai OvJth .iho tt ti tte ajo. i 
ftu'i X 8 r>i Dih 6 ta wtt ti t'l X oea. 

X fvjo OJHvJi irh. fa t6 10 P4i. X fvjo 

OJh 101. 



Fig. 15. A page from The Deseret First Book. 



type”, and he had to give the proofs four good readings and supervise many 
corrections before the pages could be stereotyped^®. 

The Book of Mormon (see Figure 16) was published in two formats. The Book 
of Mormon Part I, intended to serve as an advanced reader, consisted of The 
First Book of Nephi, The Second Book of Nephi, The Book of Jacob, The Book 



29 



Orson Pratt to Robert L. Campbell, 12 June 1869, Deseret Alphabet Printing Files 
1869, LDS Church Archives. 
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asG vJ6 30'for'i: 

roe'll ttm tit 

V NUii aoior'i, 

riv,H 

*11818' 180'! ris^o Y 11815 v;g Harj.. 



fmfoi, o *6 ^'1 -e \ i-o-in «e k tju -e ^'iQ oiso •« 

> Ij runs; I'iu lO ^ Ur-uns, io h f -€ \ Tfs .t Hi.Ji; 4*iu 

oiso 10 < 5*0 %'iu S.'iin : i*iu m if. or's‘»U )-s», wuu o»so lu u sTin -c 
»uti -(* i-< aitn'u. Imu ,'ki s)ui n, thi n ruio ^ loio. n.i >t 
ti»su^*a: 10 or-> roiL III ^ OM 1 sMMEt •e 0-u ruio n 'urnu- 
iin-'i Miue: snniuu f-juu ‘Jotoui. •ua tmi n ruio » io o-i 

loii, r'l 41*0 m, lu t Li «4» ^ 4Himi^uir*i >ti^e lu » o*n -6 

0~u: 

«.'! ifou rt-^ ^ fl'O -:c Dl-i mso; tl*c ic s i-oria -c \ 15U 

•6 TO i-t s<ui-Ki -.1 u 111 u loui o-hfF'»o-.n \ uHot-'s -( ^ nni, 

Ti .;s xf i't iMirn f ift io o.i io T.t'i; fr*c *c lo do •'uio x ue 

ifs „p It K-i TL-i 0111 L*«f X loin f.L <r*i lot xri ftx-'if ; **in x.i xe ic 
MI X (>”( JU-'ilS -6 > loin. X»1 M It H-1 OJSJ «f lot JCJt; juu oiso io X 

o*’Ut*'is*ti »<j X ^*0 •.'iti X.1 *c X X OirPiTL Ov<(], 

%s*usi‘>i i*is-ir ruio ot •iin-m. ■«i;'iii ht »f xci tt roti», xt tt x 3*suoit 
-c i.n; fufiiot. o-un;5 u-i x l*hc -c 0-a, x;i y 5t a ffuu sx-ius v 

-6 (i)ll»1. 

OOlOSJw 



itjssiti.a <50c.p sa>L, 



I8n I. 



NEW YORK: 

paSLisnEr) ron nrs VK.trKEi trjrirEitsTTr 

BY BUStiKl^L BKOS. 
lOGO. 



Fig. 16. In 1869, The Book of Mormon, a book of Mormon scripture, was published in 
two formats: the first third of the book, which cost 75 cents, and the full text, which 
cost $2. Part I had a print run of 8000 copies, and a good specimen today sells for 
perhaps $250 to $300. Only 500 copies of the full Book of Mormon were printed, and 
in 2004 an average copy sells for about $7000 or $8000. 



of Enos, The Book of Jarom, The Book of Omni and The Words of Mormon^'^. 
The entire Book of Mormon was also printed on better paper, and was more 
expensively bound. 



The Book of Mormon Part I is usually known, inaccurately, among used-book dealers 
as “The First Book of Nephi”. The Regents’ plan was eventually to offer the whole 
book in three parts, printing Parts II and III with proceeds from the sale of the first 
four books. 
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Receipts from 1868 and 1869^^ show that the punches, matrices, type and 
other printing paraphernalia remained the property of the Board of Regents 
of the Deseret University, but they were left in the care of Russell Bros, in 
expectation of future work, which in fact never materialized. Although a large 
collection of nineteenth-century punches survives at Columbia University in New 
York City, attempts to locate the Russell Bros. Deseret Alphabet punches have 
so far been unsuccessful. 

3.6 The 1868 69 “Book” Alphabet and Fonts 

After the disappointing debut of the St. Louis type, used reluctantly to print 
sample articles in the Deseret News in 1859-60 and 1864, Brigham Young had 
vowed to go to England the next time to get better workmanship^^. But in fact 
in 1868 and 1869 the Mormons went only as far as New York City, engaging 
Russell Bros, to cut new punches, strike matrices, cast type, typeset and print 
the books. 

This time they did get professional workmanship, but the resulting book font 
is still somewhat bizarre, partly because of the inherent awkwardness of the basic 
shapes, and partly because of choices in font design that now seem old-fashioned. 
A look at the book font (see Figures 15, 16 and 17) shows that the glyphs, 
compared to the earlier charts, have been Bodonified: made rigidly vertical, 
symmetrical wherever possible, and with extreme contrasts of thick and thin. 
Thom Hinckley, an expert typographer and printer (personal communication), 
has pointed out that the extreme thins of the font reveal the punch cutter as a 
master; at the same time, these thin lines would have caused the type to wear 
quickly, which was one of the very problems the Regents were trying to avoid; 
printing the extreme thins also required the use of unusually high-quality paper. 
The 38 glyphs of the 1868-69 book font were basically the same as the 38 glyphs 
used in printing articles in the Deseret News in 1859-60 and 1864; the only 
significant difference was that the old 3 glyph was mirror-imaged to £. 

Nash [21, pp. 23-29] lays out in devastating detail how the Deseret Alphabet 
type violates principle after principle of good book type, including the catas- 
trophic lack of ascenders and descenders. In the words of printing historian 
Roby Wentz [31], “The result was a very monotonous-looking line of type.” 
Hinckley has emphasized the problems of “weight” and “color” in the book font, 
resulting from the extreme contrast of thicks and thins and the uniformly thin 
short vowels. 

I believe that the problems of weight and color, including the thin repre- 
sentation of short vowels, the fussy loops that overcomplicate some glyphs, 
and the overall inharmonious collection of glyphs, go all the way back to the 
original amateur conception of the Deseret Alphabet as being suitable for both 
shorthand and everyday orthography. It was awkward enough as shorthand, 
and the translation to type was a failure that no amount of good type design 

Deseret Alphabet Printing Files 1868 and 1869, LDS Church Archives. 

Journal History, 16 February 1859. 
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Li ng Sounds. 




I^lrr. 


Ntmo. SoQsd. 


Loll^r 


N&me. 


likraotL 


1 ... 


..p 


3.. 


. . 0 as ill . 




(1 ... 


..b 


E... 


..a 


. . 

ate. 


1 ... 


..t 


E... 


..ah 


art. 


(1... 


..d 


0 .. 


..aw ‘‘ 


awglit. 


C ... 


. . chc. . as in. .cheese. 


0 .. 


..0 “ 


oat. 


V... 


••g 


0 .. 


..00 “ 


ooze. 


0... 


..k 








0... 


. .ga. ..as in. ..^ate. 




Sfuyrt Sounds of 0\e afiwf. 












p ... 


..f 


t... 


as ill. . . , 


.... it. 


G ... 


. . V 


J 


14 


et. 


1 


. .oth. .as ill. .<7iigh. 


J 


44 


at. 


X ... 


..the “ thy. 


J 


41 


ot. 


fr... 


. .3 


r 


44 


«t. 


6 ... 


. .Z 


s 


44 


book. 


1) ... 


. .csh. .as in. .fle^A. 








S ... 


..zhe “ vision. 


j . . . 




. . . JOO. 












1 ... 


..ur “ burn. 


E.. 


..ow ‘‘ 


Old. 












1 ... 


..1 


10.. 


. . woo 














D ... 


. .in 


'/.. 


..ye 








f ... 


..h 




•t ... 


..n 








n ... 


. .eng. as in. length. 



Fig. 17. The 1868-69 Book Version of the Deseret Alphabet consisted of 38 letters, with 
uppercase and lowercase characters distinguished only by size. Aside from the strange 
glyphs, the inventory, grouping and alphabetical order of the Alphabet are based solidly 
on the 1847 Alphabet of Alexander J. Ellis and Isaac Pitman (see Figure 5). 



can really cure. One need only compare the Deseret Alphabet to the Shavian 
Alphabet (see Figures 19 and 18) to see the difference between an amateur and 
a professional design. 
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"^7p-^\St7\2 5 :? 1 i 9 

1. /7C COm9 1 Jv/ Q cvlv?. COJ'O 
p/-\ V 7 AV.C. /7\S p/5 ^c'lS /> d'ooci 
/7S1>0^,. O.I'O? /IC 1^1 JlK\ JySl /ld<1 7\l,A 
l,lSl^)d7\. 7 d>i 1 J ij\ 'J l\Sw1l, 

^\iiQ \» 37llfi Ic'iV ij \^ \hl. QV./. 

Fig. 18. In this extract of Shavian script, the title is set in the Ghoti (pronounced 
“fish”) font, and the body in the Androcles font, both by Ross DeMeyere. Copyright 
© 2002 DeMeyere Design Incorporated. All rights reserved. Reproduced by permission. 



3.7 The 1870s: Decline and Fall 

The Deseret First Book and The Deseret Second Book had print runs of 10,000 
copies each and sold for 15 and 20 cents, respectively. The first third (in actual 
quantity about a fourth) of the Book of Mormon, intended as an advanced reader, 
had a print run of 8,000 copies, and sold for 75 cents. Only 500 copies of the full 
Book of Mormon were printed, and they sold for $2. Or more to the point, the 
books did not sell. 

By the mid 1870s, the Deseret Alphabet was recognized as a failure even by 
Brigham Young. The bottom line was that books were expensive to produce, 
and not even loyal Mormons could be persuaded to buy and study them. On 2 
October 1875 The Juvenile Instructor, a magazine for Mormon youth, laid the 
Deseret Alphabet to rest. 

The Book of Mormon has been printed in the Deseret Alphabet, but 
President Young has decided that they are not so well adapted for the 
purpose designed as it was hoped they would be. There being no shanks 
[ascenders or descenders] to the letters, all being very even, they are 
trying to the eye, because of their uniformity. Another objection some 
have urged against them has been that they are entirely new, and we 
should have characters as far as possible with which we are familiar: and 
they have felt that we should use them as far as they go and adopt new 
characters only for the sounds which our present letters do not represent. 
There is a system known as the [Benn] Pitman system of phonetics 
which possesses the advantages alluded to. Mr. Pitman has used all the 
letters of the alphabet as far as possible and has added seventeen new 
characters to them, making an alphabet of forty-three letters. The Bible, 
a dictionary and a number of other works, school books, etc., have been 
printed in these new characters, and it is found that a person familiar 
with our present method of reading can learn in a few minutes to read 
those works printed after this system. We think it altogether likely that 
the regents of the University will upon further examination adopt this 
system for use in this Territory. 
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peep 


'I 


■ 


L 


bib 


tot 


1 


■ 


L 


dead 


kick 


d 


■ 


? 


fag 


fee 


J 


■ 


r 


vow 


thigh 


d 


■ 


9 


they 


so 


S 


■ 


? 


zoo 


sure 


L 


■ 


7 


measure 


church 


t 


■ 


7 


judge 


yea 


\ 


■ 


/ 


*woe 


hung 


fi 


■ 


8 


ha-ha 


loll 


c 


■ 


3 


roar 


mime* 


/ 


■ 


\ 


none 


if 


I 


■ 


H 


eat 


egg 




■ 


C 


age 


ash* 


} 


■ 




ice 


ado* 


( 


■ 


7 


up 


on 




■ 


0 


oak 


wool 


V 


■ 


A 


ooze 


out 


< 


■ 


> 


oil 


ah* 


s 


■ 




awe 


are 




■ 


-Q 


or 


air 


<o 


■ 


<o 


urge 


array 


r> 


■ 


n 


ear 


ian 


Y 


■ 




yew 


the 


9 


■ 


f 


of 


and 


\ 


■ 


1 


to 



*ujrit+en top-dotjta or rigbHeft 
► for proper na<»oes, 
use »KJarner« ctet (eg, JO/, Korrie). 



Fig. 19. The Shaw or “Shavian” Alphabet was designed by typographer Kingsley Read 
and has inspired a number of other professional typographers, including Ross DeMeyere 
(http://www.demeyere.com/shavian/). The glyphs are simple and harmonious; as- 
cenders and descenders give words distinctive shapes and avoid monotony. Copyright 
© 2002 DeMeyere Design Incorporated. All rights reserved. Reproduced by permission. 
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So while the Deseret Alphabet was dead, the Mormons hadn’t yet given up on 
spelling reform. In July of 1877, Orson Pratt was sent to Liverpool to arrange to 
have The Book of Mormon and The Book of Doctrine and Covenants, another 
book of Mormon scripture, printed in the Benn Pitman orthography, “with the 
exception of two or three characters’’^^. But in August of that year, after most 
of the specially ordered phonotype had arrived from London, Brigham Young 
died; Orson Pratt was called back home, and the Mormons never dabbled in 
orthographical reform again. 

It has been written, and repeated numerous times, that “the Deseret Al- 
phabet died with Brigham Young”; however, the Deseret Alphabet had already 
been dead for at least a couple of years, and what died with Brigham Young was 
a very serious project, well in progress, to print Mormon scripture in a slight 
modification of Benn Pitman’s “American phonotypy” . 



4 The Deseret Alphabet in Unicode 

4.1 The Character Inventory and Glyphs 

The Deseret Alphabet was first added to the Unicode 3.1 standard^"* in 2001, in 
the surrogate space 10400-1044F, mostly through the efforts of John H. Jenkins 
of Apple Computer^^. It holds some distinction as the first script proposed for 
the surrogate space; as Jenkins describes it, “Nobody started to implement sur- 
rogates because there were no characters using them, and nobody wanted their 
characters to be encoded using surrogates because nobody was implementing 
them”^®. The Deseret Alphabet, being a real but pretty dead script, was chosen 
as a pioneer - or sacrificial lamb ~ to break the vicious circle. 

The Unicode 3.1 encoding handled only the 38-letter version of the Deseret 
Alphabet (this made 76 characters, including uppercase and lowercase) used in 
the printed books of 1868-69. The implementors were honestly unaware that 
earlier 39- and 40-letter versions of the Alphabet had been seriously used, and 
so might need to be encoded. I later argued vigorously^"^ for the addition of the 
/o^/ and /%/ letters used in several earlier versions of the Alphabet, including 
the one used in the Haskell journal and Shelton letters that I have transcribed. 
John Jenkins backed me up^® and again deserves the credit for dealing with most 
of the paperwork and bureaucracy. 

The two new letters were included in Unicode 4.0, but unfortunately I could 
not persuade them to use the 1859-60 glyphs 0 and 9 as the citation glyphs; 
instead they went all the way back to the primitive glyphs of the 1854-55 charts. 
Unicode fonts based on the current heterogeneous collection of glyphs will be 
useless for any practical typesetting of 40-letter Deseret Alphabet documents. 

Journal of Discourses, vol. XIX, p. 112. 

http://www.unicode.org/ 

http : / /homepage .mac . com/jenkins/ 

http : / /homepage .mac . com/ jenkins/Deseret/ {Unicode .html, Computers.html} 
Unicode discussion document N2474 2002-05-17. 

Unicode discussion document N2473 2002-05-17. 
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1040 1041 1042 1043 1044 



a 

10400 


f 

•'0410 


8 

10420 


J 

104X) 


Q 

10440 


8 


1 

10*11 


'0*?1 


J 

’0*r 


P 

’0*41 


8 


a 

•04-; 


1 

10*22 


r 

10*02 


6 

10442 


0 

*0*0) 


1 

104*) 


0 

1042J 


s 

1043) 


L 

104*3 


0 

wo* 


a 

1041* 


•l 

10424 


■0*3* 


V 

10*44 


i e 


c 

'0*‘S 


M 

•0*:$ 


8 

'0*U 


t0*4» 


+ 

■o«o« 


9 

10**i 


vf 

*0*2» 


LJ 

10*30 


6 

'0U6 


10*0/ 


0 

104IT 


© 

10*2' 


Y 

•04)/ 


D 

104*7 


10400 


0 

'041* 


9 

:0428 


f 

'0*M 


S 

10*44 


J 

*0409 


P 

10**» 


e 

10*291 


1 

104M 


'0*49 


r 

1046* 


e 

*C*1A 


1 

8 

>0*24 


■■■ - 

a 

’0*W 


1 

'0**A 


’040A 


L 

•o*-a 


0 

1042S 


n 

104 )a 


1044B 


■040C 


1041C 


0 
(04 2C 


a 

to*x 


h 

I044C 


8 

10440 


•041C 


10*20 1 


c 

10*30 


H 

104*0 


LI 

ICHCt 


■ t 

6 

10*1* 


t 

I0<K 


9 

I04)t 


'■out- 


Y 

1 

•o*(y 


D 

•04ir 


J 

•04> 


0 


® 

1C4*I 



Fig. 20. The Deseret Alphabet as it appears in Unicode 4.0. Copyright © 1991-2003 
Unicode, Inc. All rights reserved. Reproduced by permission of Unicode, Inc. 

4.2 Unicode Character Names 

The Unicode implementation of the Deseret Alphabet is also flawed by some 
changes to the letter names. Not to criticize anyone personally, but just for the 
record, there are several reasons why the name changes were ill-advised: 
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Table 1. The Deseret Alphabet was added to Unicode by General American English 
speakers who honestly misunderstood the J (/n/) and 6 (/o/) vowels, which have 
collapsed to /a/ in their dialect, and renamed them confusingly as SHORT AH and 
LONG AH. 



Char. 


IPA 


Original Name 


Unicode Name 


8 


N 


e as in eat 


LONG I 


£ 


/e/ 


a as in ate 


LONG E 


8 


/a/ 


ah as in art 


LONG A 


e 


h! 


aw as in aught 


LONG AH 


0 


/o/ 


o as in oat 


LONG O 


(D 


/u/ 


00 as in ooze 


LONG OO 


t 


/i/ 


i as in it 


SHORT I 


J 


/e/ 


e as in et 


SHORT E 


J 


/*/ 


a as in at 


SHORT A 


J 


/d/ 


o as in ot 


SHORT AH 


r 


/a/ 


u as in ut 


SHORT O 




/»/ 


00 as in book 


SHORT OO 



1. The Deseret Alphabet had a traditional set of letter names already estab- 
lished and available. Arbitrary changes in the names make it more difficult 
to compare the original charts and the Unicode charts. 

2. Some early Deseret Alphabet writers, including George D. Watt, consciously 
or unconsciously confused the traditional letter names and their phonological 
values. Some of their spellings make sense only if the letters are read with 
their original names. 

3. Some letter-name changes were made because the implementors simply did 
not hear and understand some of the vowel distinctions provided in the 
Deseret Alphabet; they were speakers of General American English, a dialect 
that has lost some of the vowel distinctions still present in English and New 
England dialects. 

The last point is the most unfortunate. Gonsider Table 1: The original name 
for the Deseret 8 letter, which is /a/ in IPA, was “ah”, using a common 
convention in English romanization whereby “ah” represents an unrounded low- 
back vowel. Most English speakers use this vowel in the words father, bah and 
hah. In England, and in much of New England, this vowel is distinct from the 
first vowel in bother, represented in Deseret Alphabet as J or in IPA as /d/, 
which is a rounded low-back vowel; thus for these speakers the words father 
and bother do not rhyme. But the rounded /d/ has collapsed into unrounded 
/a/ in General American English, so the words do rhyme for most Americans. 
Similarly, the Deseret 6 letter, IPA /□/, represents a mid-low back rounded 
vowel that has also collapsed into /a/ for many American speakers. It can still 
be heard quite distinctly in the speech of many New Yorkers, Philadelphians, 
and New Englanders in general. The original Deseret name for the 6, “aw”, 
used a common convention for representing this rounded vowel, which occurs in 
words like law, flaw, paw, aught, caught, etc. The equivalent letter in the Shaw 




Typesetting the Deseret Alphabet with IAIJ 5 X and METRFONT 



99 



Alphabet is appropriately named AWE. Not understanding the phonological 
distinctions involved, the implementors of Unicode renamed J as SHORT AH 
and 6 as LONG AH, giving precisely the wrong clues to the pronunciation 
of these rounded vowels. Unfortunately, Unicode policy values consistency over 
accuracy, and it’s almost impossible to change character names once they have 
been adopted. 



jaC0JPQ‘^'t0OLO 

hJitsircwLYe 

jaCGdPQttOOLO 

hjn=i8nrGWLV6 

1234567890 

Fig. 21. Kearney’s Deseret font. 



5 Digital Fonts for the Deseret Alphabet 

5.1 Non- METRFONT Fonts 

Kearney’s Deseret Font. A number of digital fonts have been designed for 
the Deseret Alphabet, most of them based on the 38-letter inventory and glyphs 
of the book font of 1868-69. The following is a very preliminary survey of fonts 
that I was able to find and test in early 2004^®. 

The prize for the first digital font would seem to go to Greg Kearney, whose 
Deseret font was created about 1991 using Fontographer. Kearney (personal 
communication) says that his font, now in the public domain, was created for 
the LDS Church History Department, now the LDS Church Archives, as a 
display font for an exhibit. 

I had difficulty testing this fonf^° to input specific texts on my Mac OS X 
system, but see Figure 21 for a sample of the glyphs as displayed by the FontBook 
application. 



Bateman’s Deseret Font. Edward Bateman, a graphic designer in Salt Lake 
City, scanned the Russell Bros, fonts from a copy of The Deseret Second Book, 

The world of fonts, and especially amateur fonts, is woefully lacking in documenta- 
tion. I would be extremely grateful for corrections and additions to the information 
in this section. 

http : //www. font age . com/pages/deseret .html; http: //funsite24. com/f o/d/ 
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cleaned them up electronically using Fontographer, and created his font, also 
called Deseret, in August 1995 [3]. The font came out of his graphics work on 
the delightfully tongue-in-cheek 1995 science-fiction film Plan 10 from Outer 
Space ^^ , with a plot that revolves around a mysterious plaque written by aliens 
in the Deseret Alphabet. The font (see Figure 22) is still available from Bate- 
man^^, in both a TrueType version for Windows and a PostScript version for 
Macintosh^^. He has plans (personal communication) to repackage the font on a 
CD-ROM for modern Mac owners who no longer have a floppy-disk drive. 

An unusual feature of the Bateman font is that it contains only lowercase 
letters, or perhaps only uppercase - you really can’t tell the difference in the 
Deseret Alphabet. This font is notable for reproducing the extreme contrast of 
thicks and thins seen in the original Russell Bros. font. 



Jenkins’ Zarahemla and Sidon Fonts. John Jenkins of Apple has created two 
fonts. The first, named Zarahemla, was created about 1995, originally using Font- 
ographer (personal communication). Jenkins scanned the 1868-69 Russell Bros, 
glyphs, traced them, and cleaned them up digitally. This font is still available 
stand-alone and was part of Jenkins’ DLK^'^ (Deseret Language Kit) for typing 
Deseret Alphabet in Apple operating systems up to OS 9. The Zarahemla glyphs 
(see Figure 23) are now included in the Apple Symbols font distributed with 
OS X. Real Unicode Deseret Alphabet text can be typed using the Character 
Palette or the Unicode Hex Keyboard. 

A second Jenkins font, called Sidon, was created about 1999, originally using 
METPFONT, with the glyphs later copied into FontLab. “The idea was to have 
a Deseret Alphabet font which was not intended to just slavishly copy what 
the Church did in the 1860s.” Sidon is not yet available stand-alone, but the 
glyphs (see Figure 24) are now incorporated into the Apple Simple font used to 
demonstrate the Apple Font Tools"^^. 



Brion Zion’s Beehive Font. A certain Brion Zion (perhaps a pseudonym) at 
some point created a font named Beehive. As far as I can tell, it is no longer 
available, and numerous Internet links to Zion pages are dead. A webpage"^® 
dedicated to Deseret Alphabet fonts is a virtual cemetery of dead links. 



Kass’s Code2001 Font. The freely available Code2001 font^^ by James Kass is 
a Plane 1 Unicode-based font, providing glyphs for the characters in the surrogate 
space, including Old Persian Cuneiform, Deseret, Tengwar, Cirth, Old Italic, 

http : //www. cc .Utah. edu/~th3597/kolobl .htm 
http : //www.xmission. com/~capteddy/ 

Macintosh OS X can now handle Windows TrueType fonts, 
http : / /homepage .mac . com/ jenkins/Deseret/ 
http://fonts.apple.com/ 

http : / / cgm. cs .mcgill . ca/~luc/deseret .html 
http : / /home . att .net/~ jameskass/ code2001 .htm 
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688000 uun le ujYf nanac 900 PGLX'S<^D.S'HDhn 

tU81f0t Y] t6 JaWOJhI JG X l-JOrTO JG X Idll JG 
h6PJ>, 01-8^) JG X I80rhc^08; WH 10 X l8Dryi8, fO 
8 'fJOhrHI vJG X f08 vJG JH0 0L80 10 9tO 4*t0 

9J»i1cH: WH ac^ LJ8 vJG OrOJHGDJHI, 4HQ 0180 m X 
81ttt1 JG 1'fJP88+ JhG JG 1'JG6l8Dr'i. WH JHQ 8810 
n, > 1*10 f+0 n rhio x io<t0, xji X8 oai hJi a 

0881 'fvJ+ 0 . 



Fig. 22. Bateman’s Deseret font. 

88800© tJJJn J.8 UYf iaiaC9OOPeLX86D8'tl0>iM vJ® 

88800 ® tjjjrs A 8 UYf Tai 0 C 9 O 0 P 6 LX«- 6 DSn 3 hM 

fuEtPot, ti +6 JH jatt90j'ii vje X tjorta jg x naiu jg “laPA, 
jsa OL»o vJG X Lear'i.^i*; -ttis i® x LeorsAi*, f® ef 8 tjasrsi 
~JG X T88^ vJG Jha 01^0 1® 9t® jsa 9 j'i 1 j,i: ttih aj. ue 

JG orojsaojhi, jsa ols-o si. x jg H'jpa^t jha jg 

-tJG8L8Drh. ‘tti'i Jha ^8La n, Jha f+a n rhi® x lota, xji X8 
3J.1 hJi a aa^itj+a. 



Fig. 23. Jenkins’ Zarahemla font. 

3S800® tjjjrs i.S OVY naiaC90©P8LV«6DStL0Hl/| 

86800® tjjjrs 19 wvT 'laiac'PoopeLvssDsnohM 

Tii)6<tP0i', tn ts JH ja-n^ojHi js v tjor^'O je v latt js Hspi, jna 
ot«o JS X Lsormu, ttm "i® v Usornn#, t® st 8 -tJOhrm js v t 9« 
JS tetjjL, jHa OL#o 1® 9 t® ma 9 jhm ttin ai us js orojNaojm, 

JHa OL«o ai V js ii-jp 8 «t jna js tjsstSDrh “Ptm jna #8ta 

n, 4 Ha Tta n rm® x Lota, xn xs on hji a ae^itjta 



Fig. 24. Jenkins’ Sidon font. 
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a£06O® t^vlvjn J,8 UVT "ianaC9QQP6LY«6DS+L0HM 

3£0OO0 tjjvjrs AS UIVT ianaC9OQPQLY»6D81'h0hM 

Tuiei’poi’, tn te -jh >iatt90^Hn vJ6 v t^or+Q vJg y lanu >^6 hapA, 

Jha ou»o vdQ Y LeorhAHs: ti-nn n® y LeorhAns, t® oi- a t^ohrm 
-J6 Y ro9 vd6 te+sMu; jna ol»o n® 9t® gna 9^hnAL= i>tnh aA ue v4g 
orojhao^hn, jna ou»o aA y sit+tn <jg it<*jpa»t gna vJG i-^GaheDrH. 
'f’tnh viha »aua ri, >ma fta rn rm® y Loi*a, yjt ye oat HviJt a 
aasn+N^ta. 

Fig. 25. Kass’s Code2001 font. 

Gothic, etc. Kass informs me that the glyphs (see Figure 25) were designed from 
scratch and resided originally in the Private Use Area of the Code2000 font until 
Deseret was officially accepted and assigned code points in the surrogate space. 



Thibeault’s Deseret and Bartok’s HuneyBee Fonts. Daniel Thibeault took 
the Deseret Alphabet glyphs from the Code2001 font and transposed them into 
the ANSI range to make yet another font named Deseret"^®. Stephen Bartok’s 
HuneyBee font^® was created in September 2003 by rearranging the glyphs in 
Thibault’s Deseret font to effect a different keyboard layout (personal communi- 
cation). In both fonts the glyphs are ultimately from the Code2001 font, already 
illustrated in Figure 25. 



Elzinga’s Brigham Font. Dirk Elzinga of the Department of Linguistics and 
English Language at Brigham Young University is working on a new font called 
Brigham (see Figure 26), using FontForge, that is largely mono-width but judi- 
ciously uses thinner strokes for the loops. 



Robertson’s Fonts. Graphic designer Ghristian Robertson is working on two 
fonts, “trying to make the Deseret Alphabet look good in type” (personal com- 
munication), which is quite a challenge. In his first font, Robertson is not afraid to 
“take out some of the curly queues that really mucked things up” , to rethink the 
representation of the short vowels, to add serifs, and even to introduce something 
like ascenders. The sample in Figure 27, kindly provided by Robertson, does not 
represent the latest version of his font, and the text is gibberish, but it illustrates 
his innovative approach. Robertson’s next font will be even more challenging, 
designed for typesetting the early cursive manuscripts from 1854-55. 

http : //www. angelf ire . com/pq/Urhixidur /Fonts/Fonts .html 

http : / /home . earthlink . net/'^ slbartok/pro j ect s/fonts . htm 
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88800® tJJJr'l <1.8 liJYf ia'lQC90QPeL^JS>6DS1-LD‘iH 
88800® +JJJr<1 J.8 aJYf iaiQC90QPeLX8-6DS1'L3hH 

fuei'POi', +1 +6 j*! jai'+9Dj'ii je x <fJor<fa j8 laii je hap<>., jha 
OLS'O je X LeoruTs-: nih i® x LeDrwT&, f® si- 8 I'jo'ir'ii je x 
fes je tSfjJL; J^a olso i® 9+® j^a 9jhi<j.l: nih a<i. we jg 
orDJhaDjm, Jha ol»o aj. x je i'fjp88'+ JHa j 8 i’J88L8Dr'i. 

jhG S8LG n, Jha na n rhi® x Lota, xji xe D<*.n hJi a 
aas-itj+a. 

Fig. 26. Elzinga’s Brigham font. 

o\ou to 6apo 8 if€ jocl hcaea uao y 
aasrai ,/ipra-n ea voo «oa ea/ioi #0 «oa 
Dv^aip ao VO 0 owu to 6o ir «ool wi 
so loeaooi lus /a \oi S0 S0a av^au to 6o 
ir SQ0L i-a 401 so 4oca capo ao ir sooi jia 
481 so 4008001 fO 401 S0 S0d aV«8lP 
ao voo O40U to 6o ir J-a 40i so 40ca 

G8P0 aa 481 so 40G88 <1.8 J0EI 48868 WaO G8P0 

Fig. 27. Robertson’s experimental font. 

33800® tJJvJn J.0 U¥f iaiaC9C3QPeL^86DS+l9‘iH 6? 

93800® tJJJfl 1.8 WYf iaiflC90eP6LX86DStL9'iM 0? 

fU3+P0+, n t6 i\ J8+t99Jh1 J6 V +JOr+0 J6 X 131L J6 'l3Pi.. J^0 0L80 J6 
X l39rU18; +tTi 10 X L39rU18, f® 8+ 8 +J9'im J6 X T88 J6 t6+JJl; 
1*10 0180 1® 9t® J'i0 9 J'iUL: +t1'i 0i. LI3 J6 Or9J'i09J'i1, Jh0 0180 0i. X 
81t+n J6 1fJP38t Jh0 J6 +Je9L3Dr^. +n\ Jh0 83L0 H, Jh0 ft0 H Phi® X 
Lo+0, XJI X3 9J.1 yi a 0381+Jta. 

Fig. 28. Beesley’s desalph font. 



5.2 Beesley’s METRFONT desalph Font and Package 

My own desalph font (see Figure 28) was created with METRFONT for the specific 
purpose of typesetting 40-letter Deseret Alphabet manuscripts from 1859-60. 
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These documents were typically written with narrow nib pens, producing some 
thick-thin distinction, so the coding relies heavily on METPFONT penstroke 
commands. I took my inspiration from the pre-book charts of 1854-55, and from 
real handwriting. The penstrokes follow the path used to draw the glyphs, giving 
a hint of the original handwriting that is completely obscured in the Bodonified 
book font of 1868-69. 

The desalph font is made available in a desalph package, which can be used in a 
I^Tf^X document much like the TIPA package®^. The input of Deseret Alphabet 
characters can be done somewhat clumsily using commands like \dalclongi 
(Deseret Alphabet lowercase long i) for a or \dauclongi (Deseret Alphabet 
uppercase long i) for 8. Inside \textda{} commands, a more convenient system 
of transliteration “shortcuts” can be used. As I was already somewhat com- 
fortable with the shortcuts of the TIPA package, for entering IPA letters, I 
laid out the desalph font internally so that the same shortcuts could be used 
wherever possible. Simple commands were defined to enter diphthongs and af- 
fricates, which have no shortcuts in TIPA. A simply defined \ipa{} command 
allows the same commands to be used to enter equivalent IPA diphthongs and 
affricates. The principal entry commands are summarized in Table 2, and some 
extra commands for unusual and idiosyncratic glyphs are shown in Table 3. 
Uppercase letters, found in Deseret Alphabet but not in IPA, can be entered 
with corresponding uppercase “uc” commands with names like \dauclongi, or 
by placing the shortcut in the \uc{} command, e.g. \uc{i}. 

The use of METRFONT allowed me to define the proper glyphs for the 1859-60 
manuscripts, especially the 6 used for /qI/ and the 7 used for /%/, which I have 
never seen in a printed chart or document®^. When I found a manuscript with 
the experimental new letter I for the neutral vowel called schwa (/o/), making a 
41-letter alphabet, adding it to my METRFONT font was a simple exercise. 

The skeleton example in Figure 29 illustrates the use of the desalph and 
tipa packages, and the definition of the \ipa{} command. This file yields the 
following output: 

A sample of Deseret Alphabet entered using shortcuts: 
a38eO(DtJJJrwe8?UVfiai0C9O(3P6LK86D8+L9'iM 

Parallel phonemic IPA entered using the same shortcuts: 
ieaoouie80DAua-^a’a'"Juwjhpbtdtf<^kgfv03szJ'3rlmnq 

6 Current and Future Projects 

6.1 The Deseret Alphabet and Native American Languages 

Although the Deseret Alphabet was intended for writing English, there was some 
hope and expectation that it could be used to transcribe other languages, that it 

http : //tooyoo . 1 .u-tokyo . ac . jp/~fkr/ 

An 6 punch appears in the set of St. Louis punches of 1857, but it was not used 
when printing finally started in 1859. 
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Table 2. Commands from the desalph package to insert 1859-60 Deseret Alphabet 
glyphs into running text, and shortcuts that can be used in desalph environments. The 
single-letter shortcuts are parallel to the input transliteration for the TIPA package. 
The commands defined for diphthongs and affricates can also be used inside \ipa{} 
commands, allowing the same entry method to be used for both the Deseret Alphabet 
and equivalent phonemic IPA. 



Deseret 


Command 


Shortcut 


IPA 


3 


\dalclongi 


i 


i 


3 


\dalclonge 


e 


e 


8 


\dalclonga 


A 


a 


e 


\dalclongaw 


0 


D 


0 


\dalclongo 


0 


0 


0) 


\dalclongu 


u 


u 


t 


\dalcshorti 


I 


I 


j 


\dalcshorte 


E 


£ 


j 


\dalcshorta 


\ae 


ae 


j 


\dalcshortaw| 


6 


D 


r 


\dalcshorto 


2 


A 


1 


\dalcshortu 


U 


u 




\ dal cay 


\al or \aJ 


a-^ 


e 


\dalcoi 


\DI or \0J 


0-^ 


8 


\dalcow 


\aU or \aW 




? 


\dalcyu 


\ju or \Ju 


■^u 


U 


\dalcwu 


w 


w 


V 


\dalcye 


.] 


j 


t 


\dalch 


h 


h 


1 


\dalcpee 


P 


P 


a 


\dalcbee 


b 


b 


a 


\dalctee 


t 


t 


a 


\dalcdee 


d 


d 


c 


\dalcchee 


\ts 




9 


\dalcjee 


\dZ 




0 


\dalckay 


k 


k 


(3 


\dalcgay 


g 


9 


p 


\dalcef 


f 


f 


6 


\dalcvee 


V 


V 


L 


\dalceth 


T 


0 


K 


\dalcthee 


D 


6 


8 


\dalces 


s 


s 


6 


\dalczee 


z 


z 


D 


\dalcesh 


s 


J 


S 


\dalczhee 


Z 


3 


+ 


\dalcer 


r 


r 


L 


\dalcel 


1 


1 


9 


\dalcem 


m 


m 


h 


\dalcen 


n 


n 


M 


\ dale eng 


N 


9 
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Table 3. Extra commands used to enter rare and idiosyncratic Deseret Alphabet 
glyphs. 



ffi 

ffi 

I 

3 



\daucslju 
\dauchaskoi 
\dauc schwa 
\daucspellerow 



St. Louis 1857 font, unused glyph for /^u/ 
Haskell’s idiosyncratic glyph for /oY 
Shelton’s proposed glyph for schwa /a/ 
Deseret Phonetic Speller glyph for /a'^/ 



\documentclass [] •[ article} 

\usepackage{times} 

\usepackage{desalph} 

\usepackage{tipa} 

"/o commands used in \ipa{}, parallel to commands in \textda-[}, to get 
"/o Eui equivalent phonemic IPA transliteration of Deseret Alphabet 
\newcommand{\ipa} [1] {{Xtipaencoding"/, 

\pr ovidecommand{\aI}{}\renewcommand{\aI}{a\t ext super script{j}\xspace}"/, 
\pr ovidecommand{\a J}-[}\renewcomm and{\aJ}{a\t ext super script{j}\xspace}"/, 
\providecommand{\01}-[}\renewcommand{\01}{0\textsuperscript{j}\xspace}"/, 
\pr ovidecommand{\D J}{}\renewcomm and{\0 J}{0\t ext super script{j}\xspace}"/, 
\pr ovidecommand{\aU}-[}\renewcomm and{\aU}{a\t ext super script{w}\xspace}"/, 
\providecommand{\aW}-[}\renewcommand{\aW}{a\textsuperscript{w}\xspace}"/, 
\providecommand{\ju}{}\renewcommand{\ j u}{\text super script {j }u\xspace}"/, 
\providecommand{\Ju}{}\renewcommand{\ Ju}{\text super script -fj }u\xspace}"/, 
\providecommand{\dZ}{}\renewcommand{\dZ}{\textdyoghlig\xspace}"/o 
\providecommand{\tS}{}\renewcommand{\tS}{\textteshlig\xspace}#l}} 

\begin{ document} 

\begin{center} 

A sample of Deseret Alphabet entered using shortcuts :\\ 

\textda{i e A D o u I E \ae{} 6 2 U \al{} \0I{} \aU-f} \ju{} 
w j h p b t d \tS-f} \dZ{} kgfvTDszSZrlmnN} 

\smallskip 

Parallel phonemic IPA entered using the same shortcuts :\\ 

\ipa{i e A 0 o u I E \ae{} 6 2 U \al{} \QI{} \aU-[} \ju-[} 
w j h p b t d \tS{} \dZ{} kgfvTDszSZrlmnN} 

\end{center} 

\end{ document} 

Fig. 29. A skeleton example using the TIPA and desalph packages. 



could serve as a kind of international phonetic alphabet®^ . The Deseret Alphabet 
reform coincided with a period of intense Mormon interest in Native Americans, 
and there is growing evidence that missionaries tried to use the Alphabet in 
the field. For example, Isaac Bullock wrote a Shoshone vocabulary that includes 



52 



Parley P. Pratt to Orson Pratt, 30 January 1854, Orson Pratt Incoming Correspon- 
dence, LDS Church Archives. Journal History, 4 June 1859. 
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Deseret Alphabet pronunciations for at least some of the Shoshone words®^. In 
1859, Marion J. Shelton tried to teach the Deseret Alphabet to the Paiutes in 
the area of Santa Clara, Utah, and there are hints that missionaries may have 
tried to introduce Deseret- Alphabet-based literacy to the Navajo, the Zuhi, the 
Creeks, and other tribes. Much research remains to be done in this area. 

6.2 The Second Mormon Mission to the Hopi: 1859 60 

In the last couple of years, it has become clear that there was a serious attempt 
to introduce Deseret-Alphabet-based literacy to the Hopi. In 1859, President 
Brigham Young personally chose Marion J. Shelton, instructed him to go to 
Hopi-land, stay a year, learn the language and try to “reduce their dialect 
to a written language” using the Deseret Alphabet®'^. This was the second of 
fifteen early missions to the Hopi [23, 13, 14]. In December of 2002 I discovered 
an uncatalogued and unidentified “Indian Vocabulary” in the LDS Church 
Archives, and I was able to identify it as English-to-Hopi. I have argued [8] 
that it was written by Marion J. Shelton during this mission, and it appears to 
be the oldest written evidence of the Hopi language. 

The entire vocabulary has now been typed into an XML format, with fields 
added for modern English and Hopi orthography, modern dictionary definitions, 
and comments and references of various kinds. The XML file is downtranslated 
using a Perl-language script, with the helpful Perl XML::Twig package®^, to 
produce DT[;]X source code with Deseret Alphabet output, using the desalph 
package and font, and equivalent phonemic IPA output, using the TIPA pack- 
age. The use of XML, the desalph font, TIPA and DTeX allows me and my 
co-author Dirk Elzinga to reproduce this extraordinary document for study and 
publication. Creating and maintaining the original data in an XML format gives 
us all the advantages of XML validation and abstraction; and the flexibility 
of downtranslation to DTeX allows us to format the output in different ways 
suitable for proofreading or for final publication. 

The English-Hopi Vocabulary (see Figure 30) is written entirely in the De- 
seret Alphabet and includes 486 entries like the following 

‘Natrs'ito i3c.oo.fo 

with an English word on the left and a Hopi word in Third Mesa (Orayvi) dialect 
on the right. Encoded as XML, and with auxiliary information added, this entry 
appears as shown in Figure 31. The XML file is validated using a Relax NG 
schema. Downtranslation of the XML entry currently yields the DTgX output in 
Figure 32, which is a line in a table. When typeset, the entry appears as shown 
in Table 4. This open tabular format is ideal for proofreading, and for the final 
paper all that will be required is a modified Perl script to downtranslate the 
same XML file into other DTeX codes that waste less space. 

Glossary of Isaac Bullock, University of Utah Library, Special Collections. 

Brigham Young to Jacob Hamblin, 18 September 1859, Brigham Young Outgoing 
Correspondence, LDS Church Archives, 
http : //www.xmltwig. com/xmltwig/ 
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a . 


6ifn 


1 00990-90 






dri.oo 


a<sc 


OUD0\i-090r 




. 


, ydo 










yO.i'd&O.Of^ 


90‘> 


G0\I9 
















a<s . 900 


9r9-G> 


9 . ao-voM-o 




• 


HL\lkio^9, yrjs/ur 


'/UMJxl , )0-OO7-<ff 






Uf.iy/p:<>&r 


Qi\/7 


orcoD-yr 






ar99-n<9>, ‘fOjao.ir 




Gif-off-lr 








7un 










noo 


, trira'^.^r . 






f , 3r KKf 0-7 0.07- 


9^0 


0-97 < 














£/OfO , l^V7.9 









Fig. 30. A selection from the English-to-Hopi vocabulary showing parts of the entries 
for words starting with /b/ and /t/ in English. The entry for bread, /bred/=/pik/ 
(a+J0=H3o), is the second from the top on the left; the Hopi word is now written piiki. 
The entry for boy, /boY=/ti.o/ (ae='l3.o), is the fourth from the top; the word is now 
written tiyo. LDS Church Archives. 



<entry> 

<lef t>r\ae{}-blt-stlk</left> 

<eng>rabbit stick</eng> 

<right>pe\tS{d . ko . ho</ right> 

<hd pages="449">puts$ I $koho ‘rabbit stick, a flat 
boomerang-like stick used for hunting; used for throwing 
and hitting it on the run’</hd> 

<mkx/mk> 

</entry> 

Fig. 31. An XML entry for the Hopi vocabulary. 



340 & \raggedright \index{rabbit stick, 340}- rabbit stick \\ 
\textda{r\ae{}-blt-stlk} \\ 

\ipa{r\ae{}blt-stlkd & \raggedright \ipa{pe\tS{d.ko.ho} \\ 
\textda{pe\tS{}.ko.ho}- & HD p.\@ 449: puts$|$koho 
‘rabbit stick, a flat boomerang-like stick 
used for hunting; used for throwing and hitting 
it on the run’W 



Fig. 32. IA[]eX output from downloading an XML entry. 
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Table 4. Entry of the English-Hopi vocabulary typeset for proofreading. 





rabbit stick 


pe^.ko.ho 

13C.OO.fO 




340 


tjatrsito 

raebit-stik 


HD p. 449: putsjkoho ‘rabbit stick, a 
flat boomerang-like stick used for hunting; 
used for throwing and hitting it on the run’ 





I have also transcribed the journal of Thales H. Haskell, kept in the Deseret 
Alphabet from October through December of 1859, and will include it in a 
general history of the second mission to the Hopi [7]. Here, for reading practice, 
is an extract from his journal in the original Deseret Alphabet and in equivalent 
phonemic IPA. Haskell idiosyncratically uses the ® glyph for the /ol/ diphthong 
instead of the 6 glyph used by most other writers in 1859. 

uj'ii OS'! i(D 91 UUP 1 +ji an 'lO uuee tjg ai'i ii ti 039 to 9 ua ps'i at duu i+aij+tM 3 
0+181918 P981 0 J 1 11 +J 01 ua baiiia .3. ee k tia 9J'i ee k 61li9 kd ai uiK i8 tia asta 9ri'i 
si?a 13C16 s?ii 01911^6 p+iaosos luosos ua lao jpi+ abi ua S90oi sm 3 fi9 ua fja si9 
OJ're+SSDr'r UIK 8+ baifir P+lWe K3 Jia+a KD Jir?® KJ9SUe6 6J+1 91C 

went da'^n tu ma-’ wulf traep bAt no wulvz haed ben tu it kem horn aend fa'^n 
br Jeltn pripaerig e kristmAS fist got it redi aend invaTid .3. ov d hed men ov 9 
viIk^ tu it wi3 AS haed bo^ld mAtn sPud pitfiz s-^uit dAmplmz fraMkeks paenkeks 
aend pik aeftr dmr wi smokt sa:q e him aend haed SAm knnvrsejAn wiS a'^r mdiAn 
frendz Se aepird tu em^o-i demselvz veri mAtf. 

6.3 Other Possible Deseret Alphabet Typesetting Projects 

Around 1985 the original Deseret Alphabet Book of Mormon was scanned and 
OCRed under the direction of Prof. John Robertson of the Brigham Young 
University Linguistics department, and the text was proofread by Kristen 
McKendry^® . The surviving files from this project are not well organized, and 
may not be complete, but it appears that the Deseret Alphabet Book of Mormon 
could now be reproduced without too much difficulty. As the original Book of 
Mormon had a print run of only 500 copies, and as a copy today can fetch up- 
wards of $7000 or $8000, there has always been some interest in retypesetting it. 

The Deseret First Book and The Deseret Second Book had print runs of 
10,000 copies each, are therefore much more plentiful, and copies today go for 
around $200. The Deseret First Book has even been reprinted photographically 
for sale to tourists as a Utah curiosity [26], and the text has been keyed in by 
John Jenkins, and proofread by Michael Everson and by myself. Such projects 
are of interest to linguists who want to search the texts electronically. 

In 1967, LDS Church archivists found a bundle of forgotten Deseret Alphabet 
manuscripts, some of them ready for the typesetter but never printed [32]. These 

This project, circa 1985-86, used a Kurzweil scanner, which was trained to recognize 
Deseret text. However, McKendry reports (personal communication) that the raw 
output of the OCR was so poor and the proofreading so onerous that it might have 
been easier just to type in the text manually. 
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include The Doctrine and Covenants, with the Lectures on Faith; the Catechism 
of John Jaques; and the entire text of the Bible. The LDS Church Archives also 
hold the History of Brigham Young, a number of letters, an unfinished Deseret 
Phonetic Speller, journals, letters and probably a number of other documents 
still to be found. 

7 Conclusion 

Although the Deseret Alphabet was never intended for secrecy [6], few people 
then or now can be persuaded to learn it, and a number of interesting documents 
have been ignored and unstudied for over 140 years. The letters and journals are 
of interest to historians, and the phonemically written texts are also of interest 
to linguists. With the help of XML, DT[;]X, TIPA and new digital fonts for the 
Deseret Alphabet, these neglected documents are coming to light again. 
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Abstract. METRP05T is able to produce figures that look almost like 
ray-traced raster images but that remain vector-based. A small review 
of three-dimensional perspective implementations with METRP05T is 
presented. Special emphasis is given to the abilities of the author’s im- 
plementation; FERTPOET. 



1 Introduction 

There are at least four METRP05T packages related to three-dimensional dia- 
grams: 

- GNU 3DLDF: 

http : //www . gnu . org/ directory/graphics/3D/3DLDF . html 

— 3d/3dgeom: 

http : //tug. org/tex-archive/graphics/metapost/macros/3d/ 

— m3D: 

http : //www-math . univ-poitiers . f r/~phan/m3Dplain . html 

- FERTP05T: 

http : //matagalatlante . org/nobre/featpost /doc/feat examples .html 

All of these packages are individual and independent works “under construction” . 
There has been neither collaboration nor competition among the authors. Each 
produces different kinds of diagrams and each uses a different graphic pipeline. 
The following sections of this document describe these packages, in a mainly 
independent way. 



2 GNU 3DLDF 

3DLDF is not a pure METRP05T package, as it is written in G-l— I- using CWEB. 
Diagrams are also coded in G-I--I- and are compiled together with the package. 
Nevertheless, this is, of all four, the package with the greatest promise for a 
future three-dimensional-capable METRP05T. 
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1. It outputs METRP05T. 

2. Its syntax is similar to METHP05T. 

3. It overcomes the arithmetic limitations inherent in METRP05T. 

4. Both the affine transformations and the graphics pipeline are implemented 
through 4x4 matrices. 

5. Its author, Laurence D. Finston, is actively improving and maintaining the 
package. His plan includes, among many other ideas, the development of an 
input routine (to allow interactive use) and the implementation of three- 
dimensional paths via NURBS. 

Given the possible computational efficiency of this approach, one can foresee a 
system that merges the METRP05T language with the capabilities of standard 
ray-tracing software. 



3 3d/3dgeom 

This was the first documented extension of METRP05T into the third dimension 
- and also into the fourth dimension (time). Denis B. Roegel created, back in 
1997, the 3d package to produce animations of polyhedra. In 2003 he added 
the Sdgeom “module” which is focused on space geometry. It remains the least 
computationally intensive package of those presented here. 

1. Each component of a point or a vector is stored in a different numeric array. 
This eases control of a stack of points. Points are used to define planar 
polygons (faces of polyhedra) and the polygons are used to define convex 
polyhedra. 

2. When defining a polygon, a sequence of points must be provided such that 
advancing on the sequence is the same as rotating clockwise on the polygon, 
when the polygon is visible. This means that, when a polyhedron is to 
be drawn, the selection of polygons to be drawn is very easy: only those 
whose points rotate clockwise (the visible ones). Hidden line removal is thus 
achieved without sorting the polygons. 

3. Points can also be used to define other points according to rules that are 
common in the geometry of polyhedra or according to operations involving 
straight lines and/or planes and/or angles. 

4. The author plans to release an updated version with the ability to graph 
parametric lines and surfaces. 

4 m3D 

Anthony Phan developed this very interesting package but has not yet written 
its documentation. Certainly, this is, of all four, the package that can produce the 
most complex and beautiful diagrams. It achieves this using, almost exclusively, 
four-sided polygons. 
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Fig. 1. A diagram produced by m3D showing a single object, composed of spheres and 
cylindrical connections, under a spherical perspective. 




Fig. 2. A diagram produced by m3D showing a revolution surface under a central 
perspective. 

1. Complex objects can be defined and composed (see Figure 1). For example, 
one of its many predefined objects is the fractal known as the “Menger 
Sponge” . 

2. It can render revolution surfaces defined from a standard METRP05T path 
(see Figure 2). 

3. Objects or groups of polygons can be sorted and drawn as if reflecting light 
from a punctual source and/or disappearing in a foggy environment. 



5 FERTPOST 

Geared towards the production of physics diagrams, FEHTP05T sacrifices pro- 
gramming style and computational efficiency for a large feature set. 

1. Besides the usual parallel and central perspectives it can make a sort of 
“spherical distortion” as if a diagram is observed through a fish-eye lens^. 
This kind of perspective is advantageous for animations as it allows the point 
of view to be inside or among the diagram objects. When using the central 

Also possible with m3D. 
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perspective, points that are as distant from the projection plane as the point 
of view get projected at infinity, and METHP05T overflows and crashes. The 
spherical projection is always finite. 

2. It can mark and measure angles in space. 

3. It can produce shadows of some objects (see Figure 9). Shadows are cal- 
culated in much the same way as perspectives. The perspective projection, 
from 3D into 2D, is a calculation of the intersection of a straight line and 
a plane. A shadow is also a projection from 3D into 2D, only the line and 
the plane are different. The shadow must be projected onto the paper page 
before the object that creates the shadow. Shadows are drawn after two 
projections, objects are drawn after one projection and after their shadows. 

4. It can correctly draw intersecting polygons (see Figure 12). 

5. It knows how to perform hidden line removal on some curved surface objects. 
Imagine a solid cylinder. Now consider the part of the cylinder’s base that 
is the farthest away. You only see a part of its edge. In order to draw that 
part, it is necessary to know the two points at which the edge becomes 
hidden. FEHTP05T calculates this. Note that the edge is a circle, a curved 
line. FEflTPOST does not use polygons to hide lines on some curved surface 
objects. 

6. Supported objects include: dots, vectors, angles, ropes, circles, ellipses, cones, 
cylinders, globes, other curved surface objects, polygons, cuboids, polyhedra, 
functional and parametric surface plots, direction fields, field lines and tra- 
jectories in conservative force fields. 

Many of the drawable objects are not made of polygons, but rather of two- 
dimensional paths. FEflTPOST does not attempt to draw surfaces of these ob- 
jects, only their edges. This is partly because of the use of intrinsic METflPOST 
functions and partly because it eases the production of diagrams that combine 
space and planar (on paper) objects. 

One of the intrinsic METflPOST functions that became fundamental for 
FEflTPOST is the composition makepath makepen. As this converts a path into 
its convex form, it very much simplifies the determination of some edges. 

Another important aspect of the problem is hidden line removal. Hidden line 
removal of a group of polygons can, in some cases, be performed by drawing 
the polygons by decreasing order of distance to the point of view. FEflTPOST 
generally uses the Shell sorting method, although when the polygons are just the 
faces of one cuboid FEflTPOST has a small specific trick. There is also a specific 
method for hidden line removal on cylinders and another for other curved surface 
objects. 

5.1 Examples 

Some of the FEflTPOST macros are presented here. Detailed information is 
available at 

— http : / /matagalatlante . org/nobre/featpost/doc/macr oMan.html 

— CTAN : /graphics/metapost/macros/f eatpost/ 
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Each perspective depends on the point of view. FEflTPOST uses the global 
variable f , of type color, to store the (X, Y, Z) coordinates of the point of view. 
Also important is the aim of view (global variable viewcenter). Both together 
define the line of view. 

The perspective consists of a projection from space coordinates into planar 
{u,v) coordinates on the projection plane. FEHTP05T uses a projection plane 
that is perpendicular to the line of view and contains the viewcenter. Further- 
more, one of the projection plane axes is horizontal and the other is on the 
intersection of a vertical plane with the projection plane. “Horizontal” means 
parallel to the XY plane. 

One consequence of this setup is that f and viewcenter must not be on the 
same vertical line (as long as the author avoids solving this problem, at least!). 
The three kinds of projection known to FERTP05T are schematized in Figures 3, 
4 and 5. The macro that actually does the projection is, in all cases, rp. 





Fig. 3. Parallel projection. 



Physics problems often require defining angles, and diagrams are needed to 
visualize their meanings. The cingline and squareeingline macros (see Figure 6 
and the code below) support this. 

f := (5, 3. 5,1); 
beginf ig(2) ; 

cartaxes (1 ,1,1); 
color va, vb, vc, vd; 
va = (0.29,0.7,1.0) ; 
vb = (X(va) ,Y(va) ,0) ; 

VC = N((-Y(va) ,X(va) ,0)) ; 
vd = (0,Y(vc) ,0) ; 
drawarrow rp (black) — rp(va); 
draw rp (black) — rp(vb) — 

rp(va) dashed evenly; 
draw rp(vc) — rp(vd) dashed evenly; 
drawarrow rp(black) — rp(vc); 
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Fig. 4. Central projection. 




Fig. 5. Spherical projection. The spherical projection is the composition of two oper- 
ations: (i) there is a projection onto a sphere and (ii) the sphere is placed onto the 
projection plane. 



squareangline ( va, vc, black, 0.15 ); 
angline (va , red , black, 0 . 75 , 

decimal getangle(va, red) ,1ft) ; 

endf ig; 

Visualizing parametric lines is another need of physicists. When two lines 
cross, one should be able to see which line is in front of the other. The macro 
emptyline can help here (see Figure 7 and the code below). 

f := (2, 4, 1.8); 
def thelineC expr TheVal ) = 
begingroup 

numeric cred, cgre, cblu, param; 
pararni = TheVal* (6*360) ; 
cred = -0.3*cosd( pararni ); 
cblu = 0.3*sind( param ); 
cgre = param/850; 

( (cred, cgre , cblu) ) 
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Fig. 6. FERTP05T diagram using Eoigline. 



endgroup 
enddef ; 
beginf ig(l) ; 

numeric axsize, zaxpos, zaxlen; 
color xbeg, xend, ybeg, 

yend, zbeg, zend; 

axsize = 0.85; 
zaxpos = 0.55; 
zaxlen = 2.1; 

pickup pencircle scaled 1 . 5pt ; 

xbeg = (axsize, 0, 0) ; 

xend = (-axsize, 0, 0) ; 

ybeg = (0, 0, -axsize) ; 

yend = (0, 0, axsize) ; 

zbeg = (zaxpos , -zaxpos , 0) ; 

zend = (zaxpos, zaxlen, 0) ; 

drawarrow rp ( xbeg ) — rp ( xend ) ; 

drawarrow rp ( ybeg ) — rp ( yend ) ; 

defaultscale := 1.95; 

label. rt( "A", rp( xend ) ); 

label. If t( "B", rp( yend ) ); 

emptyline (false , 1 , black, 

0 . 5black, 1000 , 0 . 82, 2 , theline) ; 
drawarrow rp( zbeg ) — rp( zend ); 
label. bot( "C", rp( zend ) ); 
endf ig; 
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Fig. 7. FERTPOST diagram using emptyline. 

Cuboids and labels are always needed. The kindofcube and labelinspace 
macros fulfill this need (see Figure 8 and the code below). The labelinspace 
macro does not project labels from 3D into 2D. It only Transforms the label in 
the same way as its bounding box, that is, the same way as two perpendicular 
sides of its bounding box. This is only exact for parallel perspectives. 

f := (2, 1,0. 5); 

ParallelProj := true; 
verbatimtex 

\documentclass{article} 

\usepackage{beton, concmath, ccf onts} 

\begin{document} 

etex 

beginf ig(l) ; 

kindofcube(false,true, (0,-0. 5,0) , 

90,0,0,1.2,0.1,0.4) ; 
kindofcube(false,true, (0,0,0) , 

0,0, 0,0. 5,0. 1,0.8); 
labelinspace (false , (0.45,0. 1,0.65) , 

(-0.4, 0,0), (0,0, 0.1), 

btex 

\framebox{\textsc{Label}} 
etex) ; 

endf ig; 

verbatimtex \end{document} etex 

Some curved surface solid objects can be drawn with FERTP05T. Among 
them are cones (verygoodcone), cylinders (rigorousdisc) and globes (trop- 
icalglobe) . These can also cast their shadows on a horizontal plane (see Figure 9 
and the code below). The production of shadows involves the global variables 
LightSource, ShadowOn and HoriZon. 
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Fig. 8. FEflTPOST diagram using the macros kindofcube and labelinspace. 

f := (13,6,4.5); ShadowOn := true; 

LightSource := 10* (4, -3, 6); 
beginf ig(3) ; 

numeric reflen, frac, coordg; 
numeric fws, NumLines; 
path ella, ellb; 

color axe, cubevertex, conecenter, 
conevertex, allellaxe, ellaaxe, 

pea, peb; 

frac := 0.5; wang := 60; 

axe := (0, cosd(90-wang) , 
sind(90-wang) ) ; 

fws := 4; reflen := 0.35*fws; 

coordg := frac*fws; 

NumLines := 45; 

Horizon := -0.5*fws; 
setthestage (0 . 5*NumLines , 3 . 3*f ws) ; 
cubevertex = (0 . 3*fws , -0 . 5*fws , 0) ; 
tropicalglobe ( 7, cubevertex, 

0 . 5*fws , axe ) ; 
allellaxe : =ref len* (0.707,0.707,0) ; 
ellaaxe := reflen* ( 0.5, -0.5, 0 ); 
peb := ( -coordg, coordg, 0 ); 
rigorousdisc ( 0, true, peb, 

0.5*fws, -ellaaxe ); 

conecenter = 

( coordg, coordg, -0.5*fws ); 
conevertex = conecenter + 

( 0 , 0 , 0 . 9*fws ) ; 
verygoodcone (false , conecenter, 

blue, ref len, conevertex) ; 



endf ig; 
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Fig. 9. FERTP05T diagram using the macros rigorousdisc, verygoodcone, 
tropicalglobe and setthestage. 



Another very common need is the plotting of functions, usually satisfied by 
software such as Gnuplot (http : //www . gnuplot . info/) . Nevertheless, there are 
always new plots to draw. One kind of FERTP05T plot that just became possible 
is the “triangular grid triangular domain surface” (see Figure 10 and this code): 

f := 16*(4,1,1); 

LightSource := 10* (4, -3, 4); 
def zsu( expr xc, yc ) = 
cosd(xc*57) *cosd(yc*57)+ 

4*mexp (- (xc**2+yc**2) *6.4) enddef ; 
beginf ig(l) ; 

hexagonaltrimesh(false,52, 15,zsu) ; 
endf ig; 

One feature that merges 2D and 3D involves what might be called “fat 
sticks”. A fat stick resembles the Teflon magnets used to mix chemicals. They 
have volume but can be drawn like a small straight line segment stroked with a 
pencircle. Fat sticks may be used to represent direction fields (unitary vector 
fields without arrows). See Figure 11 (the source code follows). 

f := 2*(5,3,2); 

Spread := 70; 

NF := 0; 

beginf ig(l) ; 

numeric hstep, hmax, hsdev; 
numeric basestep, basemax, basesdev; 
numeric i, j, k, angsdev, cylength; 
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Fig. 10. FERTP05T surface plot using the macro hexagonaltrimesh. 



numeric cyradius , basen, hn; 


numeric vex 


vcy, vcz, hcurr 


numeric ycurr, hbase, aone; 


numeric atwo, zcurr, counter 


color lenvec, currpos; 


cylength 


= 0.45; 


cyradius 


= 0.1; 


basen 


= 11; 


hn 


= 3; 


basestep 


= cyradius*2.4; 


hstep 


= cylength*2 . 1 ; 


basesdev 


= cyradius*0 . 3 ; 


hsdev 


= hstep*0.04; 


hbase 


= -0.8; 


angsdev 


= 7; 


basemax 


= basen*basestep 


hmax 


= hn*hstep; 


hcurr 


= hbase ; 


counter 


= 0; 


for k=l upto hn: 



hcurr ;= hcurr + hstep; 
for i=l upto basen; 
for j=l upto basen: 

zcurr : =hcurr+hsdev*normaldeviate ; 
xcurr:= (i-1) *basestep 

+unif ormdeviate ( basestep ); 
ycurr:= (j-l)*basestep 
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+unif ormdeviate ( basestep ); 
aone : = unif ormdeviate ( 360 ); 
atwo : = angsdev*normaldeviate ; 
vcz := cosd( atwo ); 
vcy ;= sind( atwo )*sind( aone ); 
vex := sind( atwo )*cosd( aone ); 
currpos:=( xcurr, ycurr, zeurr ); 
lenvec:=cylengtli*(vcx,vcy,vcz) ; 
counter := incr( counter ); 
generatedirline ( counter, aone, 

90-atwo, cylength, currpos ); 



endf or ; 
endf or ; 
endf or ; 

NL := counter; 

director_invisible ( true, 5, false ); 
endf ig; 




Fig. 11. FERTP05T direction field macro director_invisible was used to produce 
this representation of the molecular structure of a Smectic A liquid crystal. 



Finally, it is important to remember that some capabilities of FERTP05T, 
although usable, may be considered “buggy” or only partially implemented. 
These include the calculation of intersections among polygons, as in Figure 12, 
and the drawing of toruses, as in Figure 13. These two figures show “usable” 
situations but their code is skipped. 

FERTP05T has many macros: some are specifically for physics diagrams, 
others may be useful for general purposes, some do not fit in this article and, 
sadly, some are not anywhere documented. For instance, the tools for producing 
animations are not yet documented. (These tools are completely external to 
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Fig. 12. Intersecting polygons drawn with the macro sharpraytrace. 




/y n a a n a D d o □ □ □ q q 



nn n o o D D a □ □ □ □ n o 



Fig. 13. Final FERTPOBT example containing a smoothtorus and a rigorousdisc 
with a hole. These macros may fail for some view points. 



the control of an animation is done with a Python script, and Ghostscript and 
netpbm are used to produce MPEG videos.) 

In summary, the collection of three-dimensional METflPOST software, such as 
the four reviewed packages, is large and growing in many independent directions. 
It constitutes an excellent resource for those desiring to produce good diagrams. 
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Abstract. We describe the architecture of a syntax-directed editor for 
authoring structured mathematical documents that can be used for the 
generation of MathML marknp [4]. The anthor interacts with the editor 
by typing T 13 X markup as in a normal text editor, with the difference 
that the typed marknp is parsed and displayed on-the-fly. We discnss 
issues regarding both the parsing and presentation phases and we propose 
implementations for them. In contrast with existing similar tools, the 
architecture we propose offers better compatibility with 455^ syntax, a 
pervasive use of standard technologies and a clearer separation of content 
and presentation aspects of the information. 



1 Introduction 

MathML [4] is an XML [2] application for the representation of mathematical 
expressions. Like most XML applications, MathML is unsuitable to be written 
directly because of its verbosity except in the simplest cases. Hence the editing 
of MathML documents needs the assistance of dedicated tools. As of today, such 
tools can be classified into two main categories: 

1. WYSIWYG (What You See Is What You Get) editors that allow the author 
to see the formatted document on the screen while it is being composed. The 
editor usually provides some “export mechanism” that creates XML with 
embedded MathML from the internal representation of the document; 

2. Conversion tools that generate MathML markup from different sources, typi- 
cally other markup languages for scientific documents, such as 
4)5X [5]. 

Tools in the first category are appealing, but they suffer from at least two 
limitations: a) Editing is typically presentation oriented - the author is primar- 
ily concerned about the “look” of the document and tends to forget about its 
content, b) They may slow down the editing process because they often involve 

* This work has been supported by the European Project IST-2001-33562 MoWGLI. 
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the use of menus, palettes of symbols, and, in general, the pointing device for 
completing most operations. 

In this paper we describe the architecture of a tool that tries to synthesize 
the “best of both worlds”. The basic idea is to create a WYSIWYG editor in 
that editing is achieved by typing concrete markup as the author would do in an 
actual plain text editor. The markup is then tokenized and parsed on-the-fly, a 
corresponding presentation is created by means of suitable transformations, and 
finally displayed. The editor is meant not only as an authoring tool, but more 
generally as an interface for math applications. 

Although in the paper we assume that the concrete markup typed by the 
user is T[^]X (more precisely the subset of Tf^X concerned about mathematics) 
and that presentation markup is MathML, the system we are presenting is by 
no means tied to these languages and can be targeted to other contexts as well. 
One question that could arise is: “why syntax?” We can see at least three 
motivations: first of all because of popularity in many communities. Second, 
because macros, which are a fundamental concept in are also the key to 
editing at a more content-oriented level, which is a primary requirement for 
many applications handling mathematics. Finally, because, as we will see, T^^X 
markup has good locality properties which make it suitable in the interactive 
environment of our concern. 

The body of the paper is structured into four main sections: in Section 2 
we overview the architecture of the tool while in Sections 3, 4, 5 we describe 
in more detail the main phases of the editing process (lexing, parsing, and 
transformation). Familiarity with T[;]X syntax and XML-related technologies 
is assumed. 



2 Architecture 

Several tools for the conversion of T)r;X markup suffer from two major drawbacks 
that we are not willing to tolerate in our design: (1) they rely on the TgX system 
itself for parsing the markup. While guaranteeing perfect compatibility with 
I);;]Xi this implies the installation of the whole system. Moreover, the original 
"h^X parser does not meet the incremental requirements that we need; (2) the 
lack of flexibility in the generation of the target document representation, which 
is either fixed by the conversion tool or is only slightly customizable by the user. 

To cope with problem (1) we need to write our own parser for TgX markup. 
This is well known to be a non-trivial task, because of some fancy aspects re- 
garding the very nature of TgX syntax and the lack of a proper “TgX grammar” . 
We will commit ourselves with a subset of TgX syntax which appears to be just 
what an average author needs when writing a document. As we will see, the 
loss in the range of syntactic expression is compensated by a cleaner and more 
general transformation phase. As for the lack of a TgX grammar, we perceive 
this as a feature rather than a weakness: after all Tf^X is built around the fact 
that authors are free to define their own macros. Macros are the fundamental 
entities giving structure to the document. 
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Let us now turn our attention to problem (2): recall that the general form of 
a T[^ macro definition (see The Tj^book, [5]) is 

\def (control seqnence) (parameter text) 

{(replacement text)} 

where the (parameter text) gives the syntax for invoking the macro and its 
parameters whereas the (replacement text) defines somehow the “semantics” of 
the macro (typically a presentational semantics). Thus the ultimate semantic 
load of a macro is invariably associated with the configuration of the macro at 
the point of definition. 



(a) “TJiX tree” 
g 



over 




X + 1 



(b) MathML tree 
math 



mfrac 




mn msup 




1 mrow mn 




mi mo mn 2 



X + 1 

Fig. 1. Tree representation for {l\over{x+l}~2} and corresponding MathML markup. 



We solve problem (2) by splitting up macro definitions so that structure and 
semantics can be treated independently. A well-formed TgX document can be 
represented as a tree whose leaves are either literals (strings of characters) or 
macros with no parameters, and each internal node represents a macro and the 
node’s children are the macro’s parameters. Entities like delimiters, square brack- 
ets surrounding optional parameters or literals occurring in the (parameter text) 
of macro definitions are purely syntactic and need not be represented in the tree 
if our main concern is capturing the structure of the document. Fig. 1(a) shows 
the tree structure of a simple mathematical formula. 

Once the document is represented as a tree, the process of macro expansion - 
that is, interpretation - can be defined as a recursive transformation on the 
nodes of the tree. As we will represent trees using XML, transformations can be 
very naturally implemented by means of XSLT stylesheets [3]. Fig. 1(b) shows 
the MathML tree corresponding to the TgX tree on the left hand side. The two 
trees are basically isomorphic except for the name of the nodes and the presence 
of explicit token nodes for literals in the MathML tree. This is to say that the 
MathML tree can be generated from the T(i]X tree by simple transformations. 
However, once the interpretation phase is independent of parsing (which does 
not happen in Tf^X) it is natural to define much more general transformations 
that are not just node-by-node rewritings. 
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The following are the main components of an interactive, syntax-based editor 
for structured documents: 

Input Buffer: the sequence of concrete characters typed by the author; 
Lexical Analyzer: responsible for the tokenization of the characters in the 
input buffer; 

Dictionary: a map from (control sequence) to (parameter text) which is used 
to know the syntax of macros; 

Parser: for the creation of the internal tree structure representing the docu- 
ment; 

Transformation Engine: to map the internal tree into the desired format. 

No doubt these entities are common to all tools converting Tj:]X markup into 
a different format, but the degree of mutual interdependence and the way they 
are implemented may differ considerably, especially when interactivity is a main 
concern. The added value of our approach is that it allows the author to in- 
dependently customize both the dictionary and the transformation engine, and 
the advanced user of the editor the possibility of adapting the lexical analyzer 
to languages other than T[;]X (we will spend a few more words on this topic in 
the conclusions). 

Notation. We will use the following conventions regarding lists. Lists are uni- 
formly typed, that is elements of a list are all of the same type. We use a* to 
denote the type of a list whose elements have type a. [] is the empty list; n :: x 
is the list with head element n and tail x; x@y is the concatenation of two lists 
X and y; [n\] . . . ; rzfe] is a short form for ni Uk []• 



3 Lexical Analysis 

The purpose of this phase is to tokenize the input buffer. As we are talking 
about an interactive tool, the presence of an input buffer may look surprising. 
Implementations for the input buffer range from virtual buffers (there is no 
buffer at all, characters are collected by the lexical analyzer which outputs tokens 
as they are completed) to flat buffers (just a string of characters as in a text 
editor) to structured buffers. For efficiency, we do not investigate in detail all 
the possibilities in this paper, but early experiments have shown that working 
with virtual buffers can be extremely difficult. As long as insert operations are 
performed at the right end of the buffer the restructuring operations on the 
parse tree are fairly easy, but when it comes to deletion or to modifications in 
arbitrary positions, the complexity of restructuring operations rises rapidly to 
an unmanageable level. Hence, from now on we will assume that a fiat input 
buffer is available. Whether the buffer should be visible or not is a subjective 
matter, and may also depend on the kind of visual feedback given by the editor 
on incomplete and/or incorrect typed markup. 

The outcome of the lexer is a stream (list) of tokens. Each token may have 
one of three forms: a literal, that is a single character to be treated “as is”. 




Interactive Editing of MathML Markup Using TIeX Syntax 129 



a space, that is a sequence of one or more space-like characters, or a control 
sequence, that is the name of a macro. 

Since the token stream is the only interface between the lexer and the parser, 
the lexer has the freedom to perform arbitrary mappings from the characters in 
the input buffer to tokens in the stream. In particular, some TJilX commands like 
\alpha or \rightarrow are just placeholders for Unicode characters. There is no 
point in communicating these entities as control sequences as the internal tree 
representation (XML) is able to accommodate Unicode characters naturally; 
also, treating them as literals simplifies the subsequent transformation phase. 

On the other hand, there are characters, such as curly braces { and } or 
scripting operators _ and ~, that have a special meaning. Logically these are just 
short names for macros that obey their own rules regarding parameters. What 
we propose is a general classification of parameter types which, in addition to 
parameters in normal TJ^X definitions, allows us 

— to deal with optional parameters as UT^X [6] does; 

— to treat { as just an abbreviation for \bgroup and make \bgroup a macro 
with one parameter delimited by \egroup, which we treat as the expansion 
for }. In order for this “trick” to work we have to design the parser carefully, 
as we will see in Sect. 4; 

— to treat scripting operators _ and ~ as the two macros \sb and \sp both 
accepting a so-called pre-parameter (a parameter that occurs before the 
macro in the input buffer) and a so-called post-parameter (a parameter that 
occurs after the macro in the input buffer); 

— to deal with macros that have “open” parameters. For instance \rm affects 
the markup following it until the first delimiter coming from an outermost 
macro is met. We treat \rm as a macro with an open post-parameter that 
extends as far as possible to the right. Similarly, \over can be seen as a 
macro with open pre- and post-parameters. 

In order to describe parameter types we need to define the concept of term. 
A term is either a literal or a macro along with all its parameters (equivalently, 
a term is a subtree in the parsing tree) . A simple parameter consists of one sole 
term. A compound parameter consists of one or more terms extending as far 
as possible to the left or to the right of the macro depending on whether the 
parameter is “pre-” or “post-” . A delimited parameter consists of one or more 
terms extending as far as possible to the right up to but not including a given 
token t. An optional parameter is either empty or it consists of one or more terms 
enclosed within a pair or square brackets [ and ] . The absence of the opening 
bracket means that the optional parameter is not given. A token parameter is a 
given token t representing pure syntactic sugar. It does not properly qualify as 
a parameter and does not appear in the parsing tree. 

Formally tokens and parameter types are defined as follows: 

token ::= literal(u) | space | control^pj^p^) (w) 
type ::= simple | compound | delimited(t) 

I optional I token(f) 




130 



Luca Padovani 



Table 1. Examples of and macros along with their signature. 



Macro 




Parameters 


pre 


post 


overline 




[simple] 


sqrt 




[simple] (T[ 5 X) 

[optional; simple] (UTeX) 


root 




[delimited(control(of )); simple] 


over, choose 


[compound] 


[compound] 


frac 




[simple; simple] 


rm, bf, tt, it 




[compound] 


left 




[simple; delimited(control(right)); simple] 


sb, sp 


[simple] 


[simple] 


bgroup 




[del imited (control (egroup))] 


begin 




[simple; optional; delimited(control(end)); simple] 


proclaim 




[token(space); del imited (literal (.)); token (space); 
delimited (control (par))] 



where t G token, v G string is an arbitrary string of Unicode characters, pi G 
{simple, compound}* andp2 G type* are lists of parameter types for the pre- and 
post-parameters respectively. Note that pre-parameters can be of type simple or 
compound only. 

The dictionary is a total map 

dictionary : string token 

such that for each unknown control sequence v, dictionary (v) = control^] (w). 
Table 1 shows part of a possible dictionary for some T}r;X and UTgX commands 
(mostly for mathematics) . Note how it is possible to encode the signature for the 
\begin control sequence, although it is not possible to enforce the constraint that 
the first and the last parameters must have equal value in order for the construct 
to be balanced. 

4 Parsing 

We now come to the problem of building the parse tree starting the stream 
of tokens produced by the lexical analyzer. As we have already pointed out 
there is no fixed grammar that we can use to generate the parser automatically: 
authors are free to introduce new macros and hence new ways of structuring the 
parse tree. Thus we will build the parser “by hand”. More reasons for writing an 
ad-hoc parser, namely error recovery and incrementality, will be discussed later 
in this section. 
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The following grammar captures formally the structure of a T[t;X parsing tree, 
which is the outcome of the parser: 

node ::= empty 

I literal(w) v G string 

I macro(w,a;) v G string, x G param* 

param ::= {a} a G node* 

Note that a parameter is made of a list of nodes and that literals are strings 
instead of single characters. The empty node is used to denote a missing term 
when one was expected; its role will be clarified later in this section. 

The appendix contains the Document Type Definition for the XML repre- 
sentation of parsing trees. It is simpler than the "Q;]XML DTD [7] and we 
are providing it as mere reference. 



4.1 Parsing Functions 

Table 2 gives the operational semantics of the parser. In this table only, for each 
a G node* we define a! = [empty] if a = [] and a! = a otherwise. There are four 
parsing functions: T for terms, A for pre-parameters, B for post-parameters, 
and C for delimited sequences of terms. Each parsing function is defined by 
induction on the structure of its arguments. Axioms (rules with no horizontal 
line) denote base cases, while inference rules define the value of a parsing function 
(the conclusion, below the line) in terms of the value of one or more recursive 
calls to other functions (the premises, above the line). Right arrows denote the 
action of parsing. Arrows are decorated with a label that identifies the parser 
along with its parameters, if any. The T, B, and C parsers have a parameter 
representing the list of delimiters in the order they are expected, with the head 
of the list being the first expected delimiter. The C parser also has a Boolean 
parameter indicating whether the parser should or should not “eat” the delimiter 
when it is eventually met. 

The root parsing function is T. Given a delimiter t G token and a token 
stream I G token* we have 

[\,l — > [n],l 

where n G node is the parsed term and V G token* is the part of the token 
stream that has not been consumed. Spaces are ignored when parsing terms and 
pre-parameters (rule T.4), but not when parsing post-parameters (rule BA). The 
A function differs from the other parsing functions because by the time a macro 
with pre-parameters is encountered, pre-parameters have already been parsed. 
The lists a G node* in the T, A, and C parsers represent the terms accumulated 
before the term being parsed. Note that pre-parameters are inserted at the end 
of the parameter list (rules A.2 to A.A) and that post-parameters are inserted 
at the beginning of the parameter list (rules B.5 to R.IO). This way parameter 
nodes appear in the parse tree in the same order as in the original token stream 
(rule T.5). 
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Table 2. Parsing functions for the simplified T13X markup. 



Vd e token* — > : node* x token* node* x token* 

-A * 



Vd G token* 
Vd G token* ,'ih G bool 



: node* x type* node* x param* 

: type* x token* param* x token* 
: node* x token* node* x token* 



n n . T(.d) , , . 

(T.i) o, y — > a, y (T.2) a,t :: I — > a,t :: I (t occurs m d) 
(T.3) a, literal(ii) :: i a@[literal(u)], Z (t.4 ) — — ' f ’ 

T ^ (“) / 7/ 

a, space :: I — ^ a , / 

^ / 7 7/ 

g,ffi — ^ g ,x P2,/ — ^ 

a, control ^ ^ a'@[macro(u, Z' 



n 

n n n ^ CL. OC 

(A.i) a, [] — > a, 0 (A. 2) : 

[],s::p — > o,a;@[{ [empty]}] 



A / 

g,p — » g , X 



,p — ^ g ,x 



g@[n], simple :: p — ^ g^,x@[{[n]}] 



g, compound :: p — ^ a^,x@[{g}] 



n 7 n 7 / N Pi L ^ CC. Cl X- ( -f- —L 

(S.l) 0,Z > (B-2) ^ t 

token(Z) :: p, Z :: Z — — > x,a 

n B(d) B(d) , 

p,[\ — > x,l , ^ ^ p,l — > x,l 

— eb) t (S.4) 

token(Z) :: p, [] — M x,l token(Z) :: p, Z :: Z — ^ x,l 



n , T{d) , , B(d) „ C(d, false) , , 8(d) „ 

Pyl — ^ X, I [],l — > a, I P,l — ^ X, I 

■ I ; r n 1 " 4 ; r n ;» 

simple ::p,Z — > |a!|::x,Z compound :: p, Z — > |a!|::x,Z 

8(d) C(literal(])::d,true) , , 8(d) „ 

P, [] — > x,l [],Z — > a, I p,l > X, I 

optional :: p, [] {[]}::x,Z optional :: p, literal]]) :: Z {o}:;a;,Z” 



, , B(d) 

p,t :: I > X, Z 

optional :: p, Z :: Z {]]}:; a;, Z^ 



(Z / literal]])) 



,, C(t::d, true) , , 8(d) „ 

y,z > a, I p,l > X,l 

delimited]!) :: p, Z {a\} :: x,l" 



p, C(d,b) p, 

(C.I) a, ]] — t a, 0 

C(t::<i,true) C(d,b) ^ , • ?\ 

(C.2) a,t :: I — ^ a, I (c.3) a,t :: I — ^ a,t :: I (t occurs in d) 

, f / j/ / j/ C(d,b) u /, 

a^t :: I — » a^l a^l — » a ,/ 

C{d,b) ff ft 

a,t :: I — ^ a , L 



(CA) 



(Z ^ d) 
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Example. Given that the input buffer contains the source shown in Fig. 1, 
the lexical analyzer would produce the following stream of tokens: 

^0 (bgroup), 

literal(l), control^^gQPPpQypgij (over), 

control Jdelimited(control(egroup))]) (hgroup) , 

literal(x); literal(+); literal(l); 
control(egroup); control([si,^pie], [simple]) (sp); 
literal(2); control(egroup)] 

By the application of the parsing rules given in Table 2 it can be shown that 
[], ^0@[cOntrol(eoi)] ^dcoirtr^ieoi)]) 

where n G node is the same tree shown in Fig. 1 except that the g nodes are 
labeled with bgroup. 

4.2 Error Recovery 

Parsing functions are all total functions, they always produce a result, even when 
the input token stream is malformed. Unlike parsers of batch TgX converters or 
the TgX parser itself, there will often be moments during the editing process 
when the input buffer contains incorrect or incomplete markup, for example 
because not all the required parameters of a macro have been entered yet. The 
parser must recover from such situations in a tolerant and hopefully sensible way. 
We distinguish three kinds of situations: missing parameters, pattern mismatch, 
and ambiguity, which we examine in the rest of this section. 

Missing Parameters. Consider an input token stream representing the sole 
\over macro with no arguments provided: 

[^®^i-^*^^([compound], [compound]) 

control(eoi)] 

It is easy to check that 

n , T([control(eoi)]) 

y,/i — > [macro(over, [empty; emptyjjj, 

[control(eoi)] 

More generally the parser inserts empty nodes in the parsing tree wherever 
an expected parameter is not found in the token stream. This behavior can be 
seen in rule A.2 and also in rules B.5, B.6, and B.IO where the ! operator is 
used. For optional parameters an empty node list is admitted (rules B.7 and 
B.8). 

The presence of empty nodes guarantees that the generated tree is struc- 
turally well- formed, which is crucial for the subsequent transformation phase. 
It also allows the application to give the user feedback indicating the absence 
of required parameters. In the example above, for instance, the application may 
display something like § suggesting that a fraction was entered, but neither the 
numerator nor the denominator have been. 
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Pattern Mismatch. Rules B.2 and B.3 have been marked with a f to indicate 
that the parser expects a token which is not found in the token stream. In both 
cases the parser will typically notify the user with a warning message. 

Ambiguities. In TgX one cannot pass a macro with parameters as the parame- 
ter of another macro, unless the parameter is enclosed within a group. For exam- 
ple, it is an error to write \sqrt\sqrt{x}, the correct form is \sqrt{\sqrt{x}}. 
Because we treat the left curly brace like any other macro, grouping would not 
help our parser in resolving ambiguities. However, the parser knows how many 
parameters a macro needs, because the token representing the control sequence 
has been annotated with such information by the lexer. When processing a macro 
with arguments the parser behaves “recursively” , it does not let an incomplete 
macro to be “captured” if it was passed as parameter of an outer macro. A 
consequence of this extension is that any well-formed fragment of T[;]X markup 
is accepted by our parser resulting in the same structure, but there are some 
strings accepted by our parser that cause the parser to fail. 

4.3 Incremental Parsing 

Parsing must be efficient because it is performed in real-time, in principle at 
every modification of the input buffer, no matter how simple the modification 
is. Fortunately TgX markup exhibits good locality, that is small modifications 
in the document cause small modifications in the parsing tree. Consequently 
we can avoid re-parsing the whole source document, we just need to re-parse 
a small interval of the input buffer around the point where the modification 
has occurred, and adjust the parsing tree accordingly. Let us consider again the 
example of Fig. 1 and suppose that a change is made in the markup 

{l\over{l+x}~2} {l\over{l+x+y}"2} 

(a +y is added to the denominator of the fraction). To be conservative we can 
re-parse the smallest term within braces that includes the modified part (the 
underlined fragments) . Once the term has been re-parsed it has to be substituted 
in place of the old term in the parsing tree. 

In order to compute the interval of the input buffer to be re-parsed we 
annotate the nodes of the parsing tree with information about the first and 
the last characters of the buffer which were scanned while building the node 
and all of its children. A simple visit of the tree can locate the smaller interval 
affected by the modification. 

Curly braces occur frequently enough in the markup to give good granularity 
for re-parsing. At the same time limiting re-parsing to braced terms helps control 
the costs related to the visit to the parsing tree and to the implementation of 
the incremental parsing and transformation machinery. 

5 Transformation 

The transformation phase recognizes structured patterns in the parsing tree 
and generates corresponding fragments of the result document. We have already 
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<xsl : template 

match="macro [Oname= ’ over ’ ] " > 

<m:mfrac> 

<xsl : if test="@id"> 

<xsl : attribute name="xref"> 

<xsl : value-of select="@id"/> 

</xsl : attribute> 

</xsl : if > 

<xsl : apply-templates select="p [1] "/> 
<xsl : apply-templates select="p [2] "/> 
</m:mf rac> 

</xsl : template> 

(a) 



<xsl : template 
match="macro [Oname= ’ sb ’ ] 

[p [1] /♦ [1] [self : : macro [@name=’ sp ’] ] ] "> 
<m:msubsup> 

<xsl:if test="Oid"> 

<xsl : attribute name="xref"> 

<xsl : value-of select="@id"/> 

</xsl : attribute> 

</xsl : if > 

<xsl : apply-templates select="p [l]/*/p[l]"/> 
<xsl : apply-templates select="p [2] "/> 

<xsl : apply-templates select="p [1] /*/p [2] "/> 
</m:msubsup> 

</xsl : template> 

(b) 



Fig. 2. Example of XSLT templates for the transformation of the internal parsing tree 
into a MathML tree. MathML elements can be distinguished because of the m : prefix. 



anticipated that XSLT is a very natural choice for the implementation of this 
phase. Besides, XSLT stylesheets can be extended very easily, by providing new 
templates that recognize and properly handle new macros that an author has 
introduced. 

We can see in Fig. 2 two sample templates taken from an XSLT stylesheet 
for converting the internal parsing tree into a MathML tree. Both templates 
have a preamble made of an xsl : if construct which we will discuss later in this 
section. Since the TgX tree and the MathML tree are almost isomorphic (Fig. 1) 
the transformation is generally very simple and in many cases it amounts at just 
renaming the node labels. Template (a) is one such case: it matches any node 
in the parsing tree with label macro and having the name attribute set to over. 
The node for the \over macro corresponds naturally to the mfrac element in 
MathML. The two parameters of \over are transformed recursively by applying 
the stylesheet templates to the first and second child nodes (p[l] means “the 
first p child of this node” , similarly p [2] refers to the second p child) . 

Template (b) is slightly more complicated and shows one case where there 
is some change in the structure. For combined sub/super scripts accepts 
a sequence of _ and ~ no matter in what order they occur, but MathML 
has a specific element for such expressions, namely msubsup. The template 
matches an sb node whose first parameter contains an sp node, thus detecting 
a . fragment of markup, then the corresponding msubsup element is 

created and its three children accessed in the proper position of the parsing tree. 
A symmetric template will handle the case where the subscript occurs before 
the superscript. 

5.1 Incremental Transformation 

As we have done for parsing, for transformations we also need to account for 
their cost. In a batch, one-shot conversion from Tf^X this is not generally an 
issue, but in an interactive authoring tool a transformation is required at every 
modification of the parsing tree in order to update the view of the document. 
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Intuitively, we can reason that if only a fragment of the parsing tree has 
changed, we need re-transform only that fragment and substitute the result in the 
final document. This technique makes two assumptions: (1) that transformations 
are context-free; that is, the transformation of a fragment in the parsing tree is 
not affected by the context in which the fragment occurs; (2) that we are able 
to relate corresponding fragments between the parsing and the result trees. 

Template (b) in Fig. 2 shows one case where the transformation is not 
context free: the deeper sp node is not processed as if it would occur alone, 
but it is “merged” together with its parent. More generally we can imagine that 
transformations can make almost arbitrary re-arrangements of the structure. 
This problem cannot be solved unless we make some assumptions, and the one we 
have already committed to in Sect. 4 is that braces define “black-box” fragments 
which can be transformed in isolation, without context dependencies. 

As for the matter of relating corresponding fragments of the two documents, 
we use identifiers and references. Each node in the parsing tree is annotated with 
a unique identifier (in our sample templates we are assuming that the identifier 
is a string in the id attribute). Templates create corresponding xref attributes 
in the result document “pointing” to the fragment with the same identifier in the 
parsing tree. This way, whenever a fragment of the parsing tree is re-transformed, 
it replaces the fragment in the result document with the same identifier. 

More generally, back-pointers provide a mechanism for relating the view 
of the document with the source markup. This way it is possible to perform 
operations like selection or cut-and-paste that, while having a visual effect in 
the view, act indirectly at the content/markup level. 

6 Conclusion 

We have presented architectural and implementation issues of an interactive 
editor based on T^^X syntax which allows flexible customization and content- 
oriented authoring. TgXmacs^ is probably the existing application that most 
closely adopts such architecture, with the difference that TgXmacs does not 
stick to TgX syntax as closely as we do and that, apart from being a complete 
(and cumbersome) editing tool and not just an interface, it uses encoding and 
transformation technologies not based on standard languages (XML [2] and 
XSLT [3]). 

Among batch conversion tools we observe a tendency to move towards the 
processing of content. The Tg;X to MathML converter by Igor Rodionov and 
Stephen Watt at the University of Western Ontario [8, 9] is one such tool, and the 
recent Hermes converter by Romeo Anghelache [10] is another. These represent 
significant steps forwards when compared to converters such as IXrEX2HTML^. 

A prototype tool called EdiTgX, based on the architecture described in this 
paper, has been developed and is freely available along with its source code^. No 

^ http://www.texmacs.org/ 

^ http : / / WWW . Iatex2html . org/ 

® http : / /helm. cs .unibo . it/ software/ editex/ 
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mention of MathML is made in the name of the tool to remark the fact that the 
architecture is very general and can be adapted to other kinds of markup. The 
prototype is currently being used as interface for a proof-assistant application 
where editing of complex mathematical formulas and proofs is required. In this 
respect we should remark that syntax is natural for “real” mathematics, 
but it quickly becomes clumsy when used for writing terms of programming 
languages or A-calculus. This is mainly due to the conventions regarding spaces 
(for instance, spaces in the A-calculus denote function application) and identifiers 
(the rule “one character is one identifier” is fine for mathematics, but not 
for many other languages). Note however that, since the lexical analyzer is 
completely separate from the rest of the architecture, the token stream being 
its interface, it can be easily targeted to a language with different conventions 
than those of TgX. 

The idea of using some sort of restricted d)5]X syntax for representing math- 
ematical expressions is not new. For example, John Forkosh’s MimeT[;]X^ gener- 
ates bitmap images of expressions to be embedded in Web pages. However, to 
the best of our knowledge the formal specification of the parser for simplified 
markup presented in Sect. 4 is unique of its kind. A straightforward imple- 
mentation based directly on the rules given in Table 2 amounts at only just 70 
lines of functional code (in an ML dialect), which can be considered something 
of an achievement given that parsing is normally regarded as a hard task. 
By comparison, the parsing code in MimeTgX amounts to nearly 350 lines of C 
code after stripping away the comments. 

One may argue that the simplified T[;]X markup is too restrictive, but in our 
view this is just the sensible fragment of T[;]X syntax that the average user should 
be concerned about. In fact the remaining syntactic expressiveness provided by 
T);^ is mainly required for the implementation of complex macros and of system 
internals, which should never surface at the document level. By separating the 
transformation phase we shift the mechanics of macro expansion to a different 
level which can approached with different (more appropriate) languages. Since 
this mode of operation makes the system more flexible we believe that our 
design is a valuable contribution which may provide an architecture for other 
implementers to adopt. 
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Appendix: The TML DTD 



<! ENTITY 7. TML. node " 

empty I space I literal I macro "> 
<! ENTITY 7. TML. common, attrib " 



id 


CDATA #IMPLIED 


xref 


CDATA #IMPLIED 


start 


NMTOKEN #IMPLIED 


end 


NMTOKEN #IMPLIED"> 



<! ELEMENT empty EMPTY> 

<!ATTLIST empty 7oTML . common, attrib ; > 
<! ELEMENT space EMPTY> 

<!ATTLIST space 

7oTML . common . attrib 
name NMTOKEN #IMPLIED 
literal CDATA #IMPLIED> 

<! ELEMENT literal #PCDATA> 

<!ATTLIST literal 
7oTML . common . attrib ; 
name NMTOKEN #IMPLIED> 

<! ELEMENT macro (p)*> 

< ! ATTLIST macro 

7oTML . common . attrib ; 
name NMTOKEN #REQUIRED 
literal CDATA #IMPLIED> 

<! ELEMENT p (7.TML . node ; ) *> 

<! ATTLIST p 7.TML . common, attrib ; > 
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Abstract. This paper describes how to typeset Chinese, Japanese, and 
Korean (CJK) languages with Omega, a 16-bit extension of Donald 
Knuth’s T[^. In principle, Omega has no difficulty in typesetting those 
East Asian languages because of its internal representation using 16-bit 
Unicode. However, it has not been widely used in practice because of the 
difficulties in adapting it to CJK typesetting rules and fonts, which we 
will discuss in the paper. 



1 Introduction 

Chinese, Japanese, and Korean (CJK) languages are characterized by multibyte 
characters covering more than 60% of Unicode. The huge number of characters 
prevented the original 8-bit TJ^^X from working smoothly with CJK languages. 
There have been three methods for supporting CJK languages in the T[;]X world 
up to now. 

The first method, called the subfont scheme, splits CJK characters into sets 
of 256 characters or fewer, the number of characters that a TgX font metric file 
can accommodate. Its main advantage lies in using 8-bit TgX systems directly. 
However, one document may contain dozens of subfonts for each CJK font, 
and it is quite hard to insert glue and kerns between characters of different 
subfonts, even those from the same CJK font. Moreover, without the help of 
a DVI driver (e.g., DVIPDFMx [2]) supporting the subfont scheme, it is not 
possible to generate PDF documents containing CJK characters that can be 
extracted or searched. Many packages are based on this method; for instance, 
CJK-DTeX^ by Werner Lemberg, HDTeX^ by Koaunghi Un, and the Chinese 
module in ConTgXt^ by Hans Hagen. 

On the other hand, in Japan, the most widely used TgX-based system is 
pJJilX [1] (formerly known as ASCII Nihongo TgX), a 16-bit extension of TgX 

^ Available on the as language/chinese/C JK/ 

^ Available on the “Comprehensive TL(X Archive Network” (CTAN) as 
language/korean/HLaTeX/ 

® Available on CTAN as macros/context/ 

A. Syropoulos et al. (Eds.): TUG 2004, LNCS 3130, pp. 139—148, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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localized to the Japanese language. It is designed for high-quality Japanese 
book publishing (the “p” of pTgX stands for publishing; the name 
used by another system). pJ^r^X can handle multibyte characters natively (i.e., 
without resorting to subfonts), and it can typeset both horizontally and vertically 
within a document. It is upward compatible^ with TgX, so it can be used to 
typeset both Japanese and Latin languages, but it cannot handle Chinese and 
Korean languages straightforwardly. pJJilX supports three widely-used Japanese 
encodings, JIS (ISO-2022-JP), Shift JIS, and EUC-JP, but not Unicode-based 
encodings such as UTF-8. 

The third route. Omega [3,4], is also a 16-bit extension of TgX, having 16- 
bit Unicode as its internal representation. In principle. Omega is free from the 
limitations mentioned above, but thus far there is no thorough treatment of how 
it can be used for professional CJK typesetting and how to adapt it to popular 
CJK font formats such as TrueType and OpenType. We set out to fill in this 
blank. 

2 CJK Typesetting Characteristics 

Each European language has its own hyphenation rules, but their typesetting 
characteristics are overall fairly similar. CJK languages differ from European 
languages in that there are no hyphenation rules. All CJK languages allow 
line breaking almost anywhere, without a hyphen. This characteristic is usually 
implemented by inserting appropriate glues between CJK characters. 

One fine point is the treatment of blank spaces and end-of-line (EOL) char- 
acters. Korean uses blank spaces to separate words, but Chinese and Japanese 
rarely use blank spaces. An EOL character is converted in TgX to a blank space 
and then to a skip, which is unnecessary for Chinese and Japanese typesetting. To 
overcome this problem, pTg]X ignores an EOL when it follows a CJK character. 

Moreover, whereas Korean uses Latin punctuation marks (periods, commas, 
etc.), Chinese and Japanese use their own punctuation symbols. These CJK 
punctuation symbols need to be treated somewhat differently from ordinary 
characters. The appropriate rules are described in this paper. 

3 CJK Omega Translation Process 

We introduce here the CJK Omega Translation Process (C JK-OTP)® developed 
by the authors to implement the CJK typesetting characteristics mentioned 
above. 

An Omega Translation Process (OTP) is a powerful preprocessor, which 
allows text to be passed through any number of finite state automata, which can 
achieve many different effects. Usually it is quite hard or impossible to do the 
same work with other J^;5X-based systems. 

^ Although pT[5X doesn’t actually pass the trip test, it is thought to be upward 
compatible with in virtually all practical situations. 

® Available at http://project.ktug.or.kr/omega-cjk/ 
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For each CJK language, the CJK-OTP is divided into two parts. The 
first OTP ( boundCJK.otp) is common to all CJK languages, and controls the 
boundaries of blocks consisting of CJK characters and blank spaces. The second 
OTP (one of interCHN.otp, interJPN.otp, and interKOR.otp) is specific to each 
language, and controls typesetting rules for consecutive CJK characters. 

4 Common Typesetting Characteristics 

The first task of boundCJK.otp is to split the input stream into CJK blocks 
and non-CJK blocks, and insert glue (\boundCJKglue) in between to allow line 
breaking. 

However, combinations involving some Latin and CJK symbols (quotation 
marks, commas, periods, etc.), do not allow line breaking. In this case, \bound- 
CJKglue is not inserted so that the original line breaking rule is applied. This 
corresponds to pT[;5X’s primitives \xspcode and \inhibitxspcode. 

boundCJK.otp defines seven character sets; the role of each set is as follows. 

1. {CJK} is the set of all CJK characters; its complement is denoted by ~{CJK}. 

2. {XSPCODEl} (e.g., ([{‘) is the subset of "{CJK} such that \boundCJKglue 
is inserted only between {CJK} and {XSPCODEl} in this order. 

3. {XSPC0DE2} (e.g., )]}’;,.) is the subset of "{CJK} such that \boundCJK- 
glue is inserted only between {XSPC0DE2} and {CJK} in this order. 

4. {XSPC0DE3} (e.g., 0-9 A-Z a-z) is the subset of "{CJK} such that \bound- 
CJKglue is inserted between {CJK} and {XSPC0DE3}, irrespective of the 
order. 

5. {INHIBITXSPCODEO} (e.g., — is the subset of {CJK} not allowing 
\boundCJKglue between {INHIBITXSPCODEO} and "{CJK}, irrespective of 
the order. 

6. {INHIBITXSPCODEl} (e.g., ^ » > )) J J ] ], CJK right parentheses and pe- 
riods) is the subset of {CJK} not allowing \boundCJKgIue between "{CJK} 
and {INHIBITXSPCODEl} in this order. 

7. {INHIBITXSPC0DE2} (e.g., < « f f [ [ [ |[, CJK left parentheses) is the sub- 
set of {CJK} not allowing \boundCJKglue in between {INHIBITXSPC0DE2} 
and "{CJK} in this order. 

The second task of boundCJK.otp is to enclose each CJK block in a group 
‘{\selectCJKf ontu ...}’, and convert all blank spaces inside the block to the 
command \CJKspace. 

The command \selectCJKf ont switches to the appropriate CJK font, and 
\CJKspace is defined to be either a \space (for Korean) or \relax (for Chinese 
and Japanese) according to the selected language. 

Note that if the input stream starts with blank spaces followed by a CJK 
block or ends with a CJK block followed by blank spaces, then these spaces 
must be preserved regardless of the language, because of math mode: 

{{CJK} {SPACE} $...$ {SPACE} CJK}} 




142 



Jin-Hwan Cho and Hamhiko Okumura 



and restricted horizontal mode: 

\hbox{{SPACE> {CJK} {SPACE}} 

5 Language-Dependent Characteristics 

The line breaking mechanism is common to all of the language-dependent OTPs 
( interCHN.otp, interJPN.otp, and interKOR.otp). The glue \interCJKglue is 
inserted between consecutive CJK characters, and its role is similar to the glue 
\boundCJKglue at the boundary of a CJK block. 

Some combinations of CJK characters do not allow line breaking. This is 
implemented by simply inserting a \penalty 10000 before the relevant \inter- 
CJKglue. In the case of boundCJK.otp, however, no \boundCJKglue is inserted 
where line breaking is inhibited. 

The CJK characters not allowing line breaking are defined by the following 
two classes in interKOR.otp for Korean typesetting. 

1. {CJK_FORBIDDEN_ AFTER} does not allow line breaking between CJK_FORBID- 
DEN_ AFTER and {CJK} in this order. 

2. {CJK_FDRBIDDEN_BEFORE} does not allow line breaking in between {CJK} 
and {CJK_FDRBIDDEN_BEFORE} in this order. 

On the other hand, interJPN.otp defines six classes for Japanese typesetting, as 
discussed in the next section. 

6 Japanese Typesetting Characteristics 

Most Japanese characters are designed on a square ‘canvas’. pTf^X introduced 
a new length unit, zw (for zenkaku width, or full- width), denoting the width of 
this canvas. The CJK-OTP defines \zw to denote the same quantity. 

For horizontal (left-to-right) typesetting mode, the baseline of a Japanese 
character typically divides the square canvas by 0.88 : 0.12. If Japanese and 
Latin fonts are typeset with the same size, Japanese fonts appear larger. In the 
sample shown in Figure 1, Japanese characters are typeset 92.469 percent the 
size of Latin characters, so that 10 pt (1 in = 72.27 pt) Latin characters are mixed 
with 3.25mm (= 13Q; 4Q = 1mm) Japanese characters. Also, Japanese and 
Latin words are traditionally separated by about 0.25 zw, though this space is 
getting smaller nowadays. 

Some characters (such as punctuation marks and parentheses) are designed 
on a half-width canvas: its width is 0.5 zw. For ease of implementation, actual 
glyphs may be designed on square canvases. We can use the virtual font mecha- 
nism to map the logical shape and the actual implementation. 
interJPN.otp divides Japanese characters into six classes: 

1. Left parentheses: 

Half width, may be designed on square canvases flush right. In that case we 
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Fig. 1. The width of an ordinary Japanese character, 1 zw, is set to 92.469% the design 
size of the Latin font, and a gap of 0.25 zw is inserted. The baseline is set to 0.12 zw 
above the bottom of the enclosing squares. 



ignore the left half and pretend they are half-width, e.g., \hbox to O.Szw 
{\hss}. If a class-1 character is followed by a class-3 character, then an 
\hskip 0 . 25zw minus 0 . 25zw is inserted in between. 

2. Right parentheses: ^ , ’ ”)]]}))) j j ] 

Half width, may be designed flush left on square canvases. If a class-2 
character is followed by a class-0, -1, or -5 character, then an \hskip O.Szw 
minus 0 . 5zw is inserted in between. If a class-2 character is followed by a 
class-3 character, then a \hskip 0 . 25zw minus 0 . 25zw is inserted in between. 

3. Centered points: • : ; 

Half width, may be designed centered on square canvases. If a class-3 charac- 
ter is followed by a class-0, -1, -2, -4, or -5 character, then an \hskip 0 . 25zw 
minus 0 . 25zw is inserted in between. If a class-3 character is followed by a 
class-3 character, then an \hskip 0 . 5zw minus 0 . 25zw is inserted in between. 

4. Periods: „ . 

Half width, may be designed flush left on square canvases. If a class-4 
character is followed by a class-0, -1, or -5 character, then an \hskip O.Szw 
is inserted in between. If a class-4 character is followed by a class-3 character, 
then an \hskip 0.75zw minus 0.25zw is inserted in between. 

5. Leaders: 

Full width. If a class-5 character is followed by a class-1 character, then an 
\hskip 0 . 5zw minus 0 . 5zw is inserted in between. If a class-5 character is 
followed by a class-3 character, then an \hskip 0 . 25zw minus 0 . 25zw is 
inserted in between. If a class-5 character is followed by a class-5 character, 
then a \kern Ozw is inserted in between. 

0. Class-0: everything else. 

Full width. If a class-0 character is followed by a class-1 character, then an 
\hskip 0 . 5zw minus 0 . 5zw is inserted in between. If a class-0 character is 
followed by a class-3 character, then an \hskip 0 . 25zw minus 0 . 25zw is 
inserted in between. 



Chinese texts can be typeset mostly with the same rules. An exception is the 
comma and the period of Traditional Chinese. These two letters are designed at 
the center of the square canvas, so they should be treated as Class-3 characters. 
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7 Example: Japanese and Korean 

Let us discuss how to use CJK-OTP in a practical situation. Figure 2 shows 
sample output containing both Japanese and Korean characters, which is typeset 
by Omega with the CJK-OTP and then processed by DVIPDFMa;. 



TeX^ M 



Fig. 2. Sample CJK-OTP output. 



The source of the sample above was prepared with the text editor Vim as 
shown in Figure 3. Here, the UTF-8 encoding was used to see Japanese and 
Korean characters at the same time. Note that the backslash character (\) is 
replaced with the yen currency symbol in Japanese fonts. 



¥input omega-cjk-samp I e 
¥hs i ze=75mm ¥par i ndent=¥zw 
{¥japanese 

¥TeX(iX'5i XidSI::cfc o T 

$ ttfcMS vX T A-e fe U . L $ <b 

} 

¥par¥vskip lOpt 
{¥korean 

¥TeXg MfS 

Htgol ittaolcf. 

} 

¥bye 



Fig. 3. Sample CJK-OTP source. 



The first line in Figure 3 calls another TeX file omega-cjk-sample.tex which 
starts with the following code, which loads® the CJK-OTP. 

® Omega requires the binary form of OTP files compiled by the utility otp2ocp included 
in the Omega distribution. 
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\ocp\OCPindef ault=inutf 8 
\ocp\OCPboundCJK=boundCJK 
\ocp\OCPinter JPN=inter JPN 
\ocp\OCPinterKOR=interKOR 

Note that inutfS.otp has to be loaded first to convert the input stream 
encoded with UTF-8 to UCS-2, the 16-bit Unicode. 

\ocplist\CJKOCP= 

\addaf terocplist 1 \OCPboundCJK 
\addaf terocplist 1 \0CPindef ault 
\nullocplist 
\ocplist\JapaneseOCP= 

\addbef oreocplist 2 \DCPinterJPN \CJK0CP 
\ocplist\KoreanOCP= 

\addbef oreocplist 2 \DCPinterKDR \CJK0CP 

The glues \boundCJKglue and \interCJKglue for CJK line breaking mech- 
anism are defined by new skip registers to be changed later according to the 
language selected. 

\newskip\boundCJKskip 7, defined later 
\def \boundCJKglue{\hskip\boundCJKskip} 

\newskip\interCJKskip 7. defined later 
\def \interCJKglue{\hskip\interCJKskip} 

Japanese typesetting requires more definitions to support the six classes 
defined in interJPN.otp. 

\newdimen\zw \zw=0 . 92469em 
\def \half CJKmidbox#l{\leavevmode7. 

\hbox to ,5\zw{\hss #l\hss}} 

\def \half CJKleftbox#l{\leavevmode7« 

\hbox to . 5\zw{#l\hss}} 

\def \half CJKrightbox#l{\leavevmode7« 

\hbox to .5\zw{\hss #1}} 

Finally, we need the commands \ Japanese and \korean to select the given 
language. These commands have to include actual manipulation of fonts, glues, 
and spaces. 

\f ont\def ault JPNf ont=omrml 
\def \japanese{7o 

\clearocplists\pushocplist\JapaneseOCP 
\let\selectCJKfont\def ault JPNf ont 
\let\CJKspace\relax 7. remove spaces 
\boundCJKskip= . 25em plus . 15em minus . 06em 
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\interCJKskip=Oem plus . lem minus .Olem 

} 

\f ont\def aultKORf ont=omliysm 
\def \korean{7, 

\clearocplists\pushocplist\KoreanOCP 
\let\selectCJKfont\def aultKORf ont 
\let\CJKspace\space 7, preserve spaces 
\boundCJKskip=Oem plus . 02em minus .Olem 
\interCJKskip=Oem plus . 02em minus .Olem 

> 

It is straightforward to extend these macros to create a E^T^X (A) class file. 

8 CJK Font Manipulation 

At first glance, the best font for Omega seems to be the one containing all 
characters defined in 16-bit Unicode. In fact, such a font cannot be constructed. 

There are several varieties of Chinese letters: Traditional letters are used in 
Taiwan and Korea, while simplified letters are now used in mainland China. 
Japan has its own somewhat simplified set. The glyphs are significantly different 
from country to country. 

Unicode unifies these four varieties of Chinese letters into one, if they look 
similar. They are not identical, however. For example, the letter ‘bone’ has the 
Unicode point 9AA8, but the top part of the Chinese Simplified letter and the 
Japanese letter are almost mirror images of each other, as shown in Figure 4. 
Less significant differences are also distracting to native Asian readers. The only 
way to overcome this problem is to use different CJK fonts according to the 
language selected. 




(a) Chinese Simplified (b) Japanese 

Fig. 4. Two letters with the same Unicode point. 



OpenType (including TrueType) is the most popular font format for CJK 
fonts. However, it is neither easy nor simple, even for experts, to generate 
OFM and OVF files from OpenType fonts. 

The situation looks simple for Japanese and Chinese fonts, having fixed 
width, because one (virtual) OFM is sufficient which can be constructed by 
hand. However, Korean fonts have proportional width. Since most of the popular 
Korean fonts are in OpenType format, a utility that extracts font metrics from 
OpenType fonts is required. 
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There are two patches of the ttf2tfm and ttf2pk utilities^ using the freetype 
library. The first®, written by one of the authors, Jin-Hwan Cho, generates 
OFM and OVF files from TrueType fonts (not OpenType fonts). The other®, 
written by Won-Kyu Park, lets ttf2tfm and ttf2pk run with OpenType (including 
TrueType) fonts with the help of the freetype2 library. Moreover, two patches 
can be used together. 

Unfortunately, ovp2ovf 2.0 included in recent TgX distributions (e.g., teT[r;X 
2.x) does not seem to work correctly, so the previous version 1.x must be used. 



9 Asian Font Packs and DVIPDFMai 

A solution avoiding the problems mentioned above is to use the CJK fonts 
included in the Asian font packs of Adobe (Acrobat) Reader as non-embedded 
fonts when making PDF output. 

It is well known that Adobe Reader can display and print several common 
fonts even if they are not embedded in the document. These are fourteen base 
Latin fonts, such as Times, Helvetica, and Courier - and several CJK fonts, if 
Asian font packs^® are installed. These packs have been available free of charge 
since the era of Adobe Acrobat Reader 4. Four are available: Chinese Simpli- 
fied, Chinese Traditional, Japanese, and Korean. Moreover, Adobe Reader 6 
downloads the appropriate font packs on demand when a document containing 
non-embedded CJK characters is opened. Note that these fonts are licensed 
solely for use with Adobe Readers. 

Professional CJK typesetting requires at least two font families: serif and 
sans serif. As of Adobe Acrobat Reader 4, Asian font packs, except for Chinese 
Simplified, included both families, but newer packs include only a serif family. 
However, newer versions of Adobe Reader can automatically substitute a missing 
CJK font by another CJK font installed in the operating system, so displaying 
both families is possible on most platforms. 

If the CJK fonts included in Asian font packs are to be used, there is no need 
to embed the fonts when making PDF output. The PDF file should contain the 
font names and code points only. Some ‘generic’ font names are given in Table 1, 
which can be handled by Acrobat Reader 4 and later. However, these names 
depend on the PDF viewers^^. Note that the names are not necessarily true 

^ Available from the FreeType project, http://www.freetype.org. 

® Available from the Korean 'IpX Users group, http://ftp.ktug.or.kr/pub/ktug/ 
freetype/ contrib/ttf 2pk-l . 5-20020430 .patch. 

® Available as http://chem.skku.ac.kr/~wkpark/project/ktug/ttf2pk-freetype2 
.20030314. tgz. 

Asian font packs for Adobe Acrobat Reader 5.x and Adobe Reader 6.0, Windows and 
Unix versions, can be downloaded from http : //www. adobe . com/products/acrobat/ 
acrrasianfontpack.html. For Mac OS, an optional component is provided at the 
time of download. 

For example, these names are hard coded in the executable file of Adobe (Acrobat) 
Reader, and each version has different names. 
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font names. For example, Ryumin-Light and GothicBBB-Medium are the names 
of commercial (rather expensive) Japanese fonts. They are installed in every 
genuine (expensive) Japanese PostScript printer. PDF readers and PostScript- 
compatible low-cost printers accept these names but use compatible typefaces 
instead. 



Table 1. Generic CJK font names. 





Serif 


Sans Serif 


Chinese Simplihed 


STSong-Light 


STHeiti-Regular 


Chinese Traditional MSung-Light 


MHei-Medium 


Japanese 


Ryumin-Light 


GothicBBB-Medium 


Korean 


HYSMyeongJo- 


Medium HYGoThic-Medium 



While TgX generates DVI output only, pdlT^X generates both DVI and 
PDF output. But Omega and pT[;]X do not have counterparts generating PDF 
output yet. One solution is DVIPDFMa; [2], an extension of dvipdfm^^, devel- 
oped by Shunsaku Hirata and one of the authors, Jin-Hwan Cho. 

10 Conclusion 

We have shown how Omega, with CJK-OTP, can be used for the production 
of quality PDF documents using the CJK languages. 

CJK-OTP, as it stands, is poorly tested and documented. Especially needed 
are examples of Chinese typesetting, in which the present authors are barely 
qualified. In due course, we hope to upload CJK-OTP to CTAN. 
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Abstract. This paper describes a font family designed to meet the re- 
quirements of typesetting mathematical documents in an Arabic presen- 
tation. Thus, not only is the text written in an Arabic alphabet-based 
script, but specific symbols are used and mathematical expressions also 
spread out from right to left. Actually, this font family consists of two 
components: an Arabic mathematical font and a dynamic font. The con- 
struction of this font family is a first step of a project aiming at providing 
a complete and homogeneous Arabic font family, in the OpenType for- 
mat, respecting Arabic calligraphy rules. 

Keywords: Mathematical font. Dynamic font. Variable-sized symbols, 
Arabic mathematical writing. Multilingual documents, Unicode, Post- 
Script, and OpenType. 



1 Overview 

The Arabic language is native for roughly three hundred million people living 
in the Middle East and North Africa. Moreover, the Arabic script is used, in 
various slightly extended versions, to write many major languages such as Urdu 
(Pakistan), Persian and Farsi (Iran, India), or other languages such as Berber 
(North Africa), Sindhi (India), Uyghur, Kirgiz (Central Asia), Pashtun (Afghan- 
istan), Kurdish, Jawi, Baluchi, and several African languages. A great many 
Arabic mathematical documents are still written by hand. Millions of learners 
are concerned in their daily learning by the availability of systems for typesetting 
and structuring mathematics. 

Creating an Arabic font that follows calligraphic rules is a complex artistic 
and technical task, due in no small part to the necessity of complex contextual 
analysis. Arabic letters vary their form according to their position in the word 
and according to the neighboring letters. Vowels and diacritics take their place 
over or under the character, and that is also context dependent. Moreover, the 
kashida, a small flowing curve placed between Arabic characters, is to be pro- 
duced and combined with characters and symbols. The kashida is also used for 
the text justification. The techniques for managing the kashida are similar to 
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those that can be used for drawing curvilinear extensible mathematical symbols, 
such as sum, product or limit. 

There are several Arabic font styles. Of course, it is not easy to make avail- 
able all existing styles. The font style Naskh was the first font style adopted for 
computerization and standardization of Arabic typography. So far, only Naskh, 
Koufi, Ruqaa, and to a limited extent Farsi have really been adapted to the 
computer environment. Styles like Diwani or Thuluth, for example, don’t allow 
enough simplification, they have a great variation in characters shapes, the char- 
acters don’t share the same baseline, and so on. Considering all that, we have 
decided to use the Naskh style for our mathematical font. 

The RyDArab [10] system was developed for the purpose of typesetting Ara- 
bic mathematical expressions, written from right to left, using specific symbols. 
RyDArab is an extension of the TgX system. It runs with K. Lagally’s Arabic 
system ArabTg^X [8] or with Y. Haralambous and J. Plaice’s multilingual [6] 
system. The RyDArab system uses characters belonging to the ArabT[; 5 X font 
xnsh or to the omsea font of respectively. Further Arabic alphabetic symbols 
in different shapes can be brought from the font NasX that has been developed, 
for this special purpose, using METRFONT. The RyDArab system also uses sym- 
bols from Knuth’s Computer Modern family, obtained through adaptation to 
the right-to-left direction of Arabic. 

Since different fonts are in use, it is natural that some heterogeneity will ap- 
pear in mathematical expressions typeset with RyDArab [9]. Symbol sizes, shapes, 
levels of boldness, positions on the baseline will not quite be in harmony. So, we 
undertook building a new font in OpenType format with two main design goals: 
on the one hand, all the symbols will be drawn with harmonious dimensions, 
proportions, boldness, etc., and on the other hand, the font should contain the 
majority of the symbols in use in the scientific and technical writing based on 
an Arabic script. 

Both Arabic texts and mathematical expressions need some additional varia- 
ble-sized symbols. We used the CurExt [11] system to generate such symbols. 
This application was designed to automatically generate curvilinear extensible 
symbols for TgX with the font generator METRFONT. The new extension of 
CurExt does the same with the font generator PostScript. 

While METRFONT generates bitmap fonts and thus remains inside the Tf^X 
environment, OpenType [14] gives outline and multi-platform fonts. Moreover, 
since Adobe and Microsoft have developed it jointly, OpenType has become a 
standard combining the two technologies TrueType and PostScript. In ad- 
dition, it offers some additional typographic layout possibilities thanks to its 
multi-table feature. 



2 A Mathematical Font 

The design and the implementation of a mathematical font are not easy [5]. It 
becomes harder when it is oriented to Arabic presentation. Nevertheless, inde- 
pendent attempts to build an Arabic mathematical font have been undertaken. 
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In fact, F. Alhargan [1] has sent us proofs of some Arabic mathematical symbols 
in TrueType format. 

Now we will describe the way we constructed the OpenType Arabic math- 
ematical font RamzArab. The construction of the font started by drawing the 
whole family of characters by hand. This task was performed by a calligrapher. 
Then the proofs were scanned to transform them into vectors. The scanning 
tools alone don’t produce a satisfying result, so once the design is finalized, the 
characters are processed and analyzed using special software to generate the file 
defining the font. 

In Arabic calligraphy, the feather’s quill (kalam) is a flat rectangle. The writer 
holds it so that the largest side makes an angle of approximately 70° with the 
baseline. Except for some variations, this orientation is kept all along the process 
of drawing the character. Furthermore, as Arabic writing goes from right to left, 
some boldness is produced around segments from top left toward the bottom 
right and conversely, segments from top right to the bottom left will rather be 
slim as in Figure 1. 

The RamzArab font in Figure 4 contains only symbols specific to Arabic math- 
ematics presentation plus some usual symbols found even in text mode. It is 
mainly composed of the following symbols: 

— alphabetic symbols: Arabic letters in various forms, such as characters in 

isolated standard form, isolated double-struck, initial standard, initial with 

tail, initial stretched and with loop ( J /O L) respectively); 

— punctuation marks (e.g., i . ! ! : ); 

— digits as used in the Maghreb Arab (North Africa), and as they are in the 

Machreq Arab (Middle East); 

— accents to be combined with alphabetic symbols (e.g., b) (J (J ); 

— ordinary mathematical symbols such as delimiters, arithmetic operators, etc. 

— mirror image of some symbols such as sum, integral, etc. 

In Arabic mathematics, the order of the alphabetic symbols differs from the 
Arabic alphabetic order. Some problems can appear with the alphabetic symbols 
in their multi- form. 

Generally, in Arabic mathematical expressions, alphabetic symbols are writ- 
ten without dots (e.g., ) or diacritics. This helps to avoid confusions 

with accents. The dots can be added whenever they are needed, however. Thus, 
few symbols are left. 

Moreover, some deviation from the general rules will be necessary: in a math- 
ematical expression, the isolated form of the letter ALEF can be confused with 
the Machreq Arab digit ONE. The isolated form of the letter HEH can also 
present confusion with the Machreq Arab digit FIVE. The choice of the glyphs 

t and ^ to denote respectively these two characters will help to avoid such 
confusions. Even though these glyphs are not in conformity with the homogene- 
ity of the font style and calligraphic rules, they are widely used in mathematics. 



In the same way, the isolated form of the letter KAF 



resulting from the 
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combination of two other basic elements, will be replaced by the KAF glyph in 
Ruqaa style, ^ . 

For the four letters ALEF, DAL, REH and WAW, the initial and the iso- 
lated forms are the same, and these letters will be withdrawn from the list of 
letters in initial form. On the other hand, instead of a unique cursive element, 
the stretched form of each of the previous letters will result from the combina- 
tion of two elements. It follows that these letters will not be present in the list 
of the letters in the stretched form. 

The stretched form of a letter is obtained by the addition of a MADDA- 

FATHA or ALEF in its final form ^ to the initial form of the letter to be 

stretched (e.g., ■* + ^ ^ ^ ). The glyph of LAM-ALEF has a particular 

ligature that will be added to the list. The stretched form of a character is used if 

or li?- 

there is no confusion with any usual function abbreviation (e.g., • for 

the sine function). 

The form with tail is obtained starting from the initial form of the let- 
ter followed by an alternative of the final form of the letter HEH (e.g.. 



^ ^ ) . These two forms are not integrated into the font because they 
can be obtained through a simple composition. 

The form with loop is another form of letters with a tail. It is obtained 
through the combination of the final form with a particular curl that differs 






from one letter to another (e.g., ^ ). This form will be integrated into the 

font because it cannot be obtained through a simple composition. 

The following particular glyphs are also in use: 



t < <_A ^ cJ iS ^ O ')i 



The elements that are used in the composition of the operator sum, product, 

Jt_^ '■3') 

limit and factorial in a conventional presentation ^ 1) ^ are also 

added. These symbols are extensible. They are stretched according to the covered 
expression, as we will see in the next section. 

Reversed glyphs, with respect to the vertical - and sometimes also to the 
horizontal - axis, as in Figure 1, are taken from the Computer Modern font 
family. For example, there are: 



Other symbols with mirror image forms already in use^ are not added to 
this font. Of course, Latin and Greek alphabetic symbols can be used in Arabic 
mathematical expressions. In this first phase of the project, we aren’t integrating 
these symbols into the font. They can be brought in from other existing fonts. 

^ The Bidi Mirrored property of characters used in Unicode. 
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Fig. 1. Snm symbol with vertical then horizontal mirror image. 



3 A Dynamic Font 

The composition of variable-sized letters and curvilinear symbols is one of the 
hardest problems in digital typography. In high-quality printed Arabic works, 
justification of the line is performed through using the kashida, a curvilinear 
variable lengthening of letters along the baseline. The composition of curvilinear 
extensible mathematical symbols is another aspect of dynamic fonts. Here, the 
distinction between fixed size symbols and those with variable width, length, or 
with bidimensional variability, according to the mathematical expression covered 
by the symbol, is of great importance. 

Certain systems [11] solve the problem of vertical or horizontal curvilinear 
extensibility through the a priori production of the curvilinear glyphs for certain 
sizes. New compositions are therefore necessary beyond these already available 
sizes. This option doesn’t allow a full consideration of the curvilinearity of letters 
or composed symbols at large sizes. A better approach to get curvilinear letters 
or extensible mathematical symbols consists of parameterizing the composition 
procedure of these symbols. The parameters then give the system the required 
information about the size or the level of extensibility of the symbol to extend. 
As an example, we will deal with the particular case of the opening and closing 
parenthesis as vertically extensible curvilinear symbol and with the kashida as a 
horizontally extensible curvilinear symbol. This can be generalized to any other 
extensible symbol. 

The CurExt system was developed to build extensible mathematical symbols 
in a curvilinear way. The previous version of this system was able to produce 
automatically certain dynamic characters, such as parentheses, using METH- 
FONT. In this adaptation, we propose to use the Adobe PostScript Type 3 
format [13]. 

The PostScript language defines several types of font, 0, 1, 2, 3, 9, 10, 11, 
14, 32, 42. Each one of these types has its own conventions to represent and to 
organize the font information. The most widely used PostScript font format 
is Type 1. However, a dynamic font needs to be of Type 3 [3]. 

Although the use of Type 3 loses certain advantages of Type 1, such as the 
possibility of producing hints for when the output device is of low resolution, 
and in the case of small glyphs, a purely geometrical treatment can’t prevent 
the heterogeneity of characters. Another lost advantage is the possibility of using 
Adobe Type Manager (ATM) software. These two disadvantages won’t arise in 
our case, since the symbols are generally without descenders or serifs and the 
font is intended to be used with a composition system such as TgX, not directly 
in Windows. 
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The PostScript language [7] produces a drawing by building a path. Here, 
a path is a set of segments (lineto) and third degree Bezier curves (curveto). 
The path can be open or closed on its origin (closepath). A path can contain 
several control points (moveto). Once a path is defined, it can be drawn as a 
line (stroke) or filled with a color (fill). From the graphical point of view, a 
glyph is a procedure defined by the standard operators of PostScript. 

To parameterize the procedure, the form of the glyph has to be examined to 
determine the different parts of the procedure. This analysis allows determining 
exactly what should be parameterized. In the case of an opening or closing 
parenthesis, all the parts of the drawing depend on the size: the width, the length, 
the boldness and the end of the parenthesis completely depend on the size. 
Figure 2 shows the variation of the different parameters of the open parenthesis 
according to the height. We have chosen a horizontally-edged cap with a boldness 
equal to half of the boldness of the parenthesis. The same process is applied to 
the kashida. 




Fig. 2. Parametrization of dynamic parenthesis. 
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Fig. 3. Generation of dynamic parentheses. 



Producing a dynamic parenthesis such as that in Figure 3 follows these steps: 

— collecting the various needed sizes in a parameter file par; 

— generating a file pi with the local tool par2pl starting from the par file; 

— converting the file pi into a metric file tfm with the application pltotf; 

— compiling the document to generate a dvi file; 

— converting the file from dvi to ps format. 

This process should be repeated as many times as needed to resolve overlapping 
of extensible symbols. 

The curvilinear parentheses are produced by CurExt as follows: 



$\parentheses-[ 

\matrix-[l Sc 2 Sc 3\cr 
4 & 5 & 6\cr 
7 Sc 8 Sc 9\cr 
0 & 1 & 2\cr} 

}$ 

instead of the straight parentheses given by the usual encoding in T[^]X: 




$\lef t ( 



\matrix{l Sc 2 Sc 3\cr 




/12 3\ 


4 & 5 & 6\cr 




456 


7 Sc 8 Sc 9\cr 




789 


0 & 1 & 2\cr} 




VOI2; 



\right)$ 
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Fig. 4. RamzArab Arabic mathematical font. 



In the same way, we get the curvilinear kashida with CurExt: 



\amarabmath 
$ { \ c sum_ {b=T- 1 ]■ " { s ]■ ]■$ 



(j" 

4 

1 - i ^ 



instead of the straight lengthened one obtained by RyDArab: 



\amarabmatb 

$-[\lsiim_-[b=T-l}'-[s}>$ 



We can stretch Arabic letters in a curvilinear way through the kashida by 
CurExt: 






4 Conclusion 

The main constraints observed in this work were: 

— a close observation of the Arabic calligraphy rules, in the Naskh style, toward 
their formalization. It will be noticed, though, that we are still far from 
meeting all the requirements of Arabic calligraphy; 

— the heavy use of some digital typography tools, rules and techniques. 
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RamzArab, the Arabic mathematical font in Naskh style, is currently avail- 
able as an OpenType font. It meets the requirements of: 

— homogeneity: symbols are designed with the same nib. Thus, their shapes, 
sizes, boldness and other attributes are homogeneous; 

— completeness: it contains most of the usual specific Arabic symbols in use. 

These symbols are about to be submitted for inclusion in the Unicode stan- 
dard. This font is under test for Arabic mathematical e-documents [12] after 
having been structured for Unicode [2,4]. 

The dynamic component of the font also works in PostScript under CurExt 
for some symbols such as the open and close parenthesis and the kashida. That 
will be easily generalized to other variable-sized symbols. The same adaptation 
can be performed within the OpenType format. 
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Abstract. What problems do e-documents with mathematical expres- 
sions in an Arabic presentation present? In addition to the known dif- 
ficulties of handling mathematical expressions based on Latin script on 
the Web, Arabic mathematical expressions flow from right to left and 
use specific symbols with a dynamic cursivity. How might we extend 
the capabilities of tools such as MathML in order to structure Arabic 
mathematical e-documents? Those are the questions this paper will deal 
with. It gives a brief description of some steps toward an extension of 
MathML to mathematics in Arabic exposition. In order to evaluate it, 
this extension has been implemented in Mozilla. 

Keywords: Mathematical expressions, Arabic mathematical presenta- 
tion, Multilingual documents, e-documents, Unicode, MathML, Mozilla. 



1 Overview 

It is well known that HTML authoring capabilities are limited. For instance, 
mathematics is difficult to search and Web formatting is poor. For years, most 
mathematics on the Web consisted of texts with scientific notation rendered as 
images. Image-based equations are generally harder to see, read and comprehend 
than the surrounding text in the browser window. Moreover, the large size of 
this kind of e-document can represent a serious problem. These problems become 
worse when the document is printed. For instance, the resolution of the equations 
will be around 72 dots per inch, while the surrounding text will typically be 300 
or more dots per inch. In addition to the display problems, there are encoding 
difficulties. Mathematical objects can neither be searched nor exchanged between 
software systems nor cut and pasted for use in different contexts nor verified as 
being mathematically correct. As mathematical e-documents may have to be 
converted to and from other mathematical formats, they need encoding with 
respect to both the mathematical notation and mathematical meaning. 

The mathematical markup language MathML [14] offers good solutions to 
the previous problems. MathML is an XML application for describing mathe- 
matical notation and capturing both its structure, for high-quality visual display. 
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and content, for more semantic applications like scientific software. XML stands 
for extensible Markup Language. It is designed as a simplified version of the 
meta-language SGML used, for example, to define the grammar and syntax 
of HTML. One of the goals of XML is to be suitable for use on the Web by 
separating the presentation from the content. At the same time, XML grammar 
and syntax rules carefully enforce document structure to facilitate automatic 
processing and maintenance of large document collections. 

MathML enables mathematics to be served, received, and processed on the 
web, just as HTML has enabled this functionality for text. MathML elements 
can be included in XHTML documents with namespaces and links can be 
associated to any mathematical expression through XLink. Of course, there 
are complementary tools. For instance, the project OpenMath [12] also aims 
at encoding the semantics of mathematics without being in competition with 
MathML. 

Now, what about some of the MathML internationalization aspects - say, 
for instance, its ability to structure and produce e-documents based on non-Latin 
alphabets, such as mathematical documents in Arabic? 

2 Arabic Mathematical Presentation 

Arabic script is cursive. Small curves and ligatures join adjacent letters in a 
word. The shapes of most of the letters are context-dependent; that is, they 
change according to their position in the word. Certain letters have up to four 
different shapes. 

Although some mathematical documents using Arabic-based writing display 
mathematics in Latin characters, in general, not only the text is encoded with 
the Arabic script but mathematical objects and expressions are also encoded 
with special symbols flowing from right to left according to the Arabic writing. 
Moreover, some of these symbols are extensible. 

Mathematical expressions are for the most part handwritten and introduced 
as images. A highly-evolved system of calligraphic rules governs Arabic hand- 
writing. Though Arabic mathematical documents written by hand are sometimes 
of fair quality, the mere presentation of scientific documents is no longer enough, 
since there is a need for searchability, using them in software and so on. 

The RyDArab [8] system makes it possible to compose Arabic mathematical 
expressions of high typographical quality. RyDArab complements TgX for type- 
setting Arabic mathematical documents. RyDArab uses the Computer Modern 
fonts and those of H [4] or ArabTgX [7]. The output is DVI, PS, PDF or 
HTML with mathematical expressions as images. The RyDArab [2] system does 
not replace or modify the functionality of the Tfi]X engine, so it does not restrict in 
any way the set of macros used for authoring. Automatic translation from and to 
Latin-based expressions is provided beginning with the latest RyDArab version. 
Will this be enough to structure and typeset e-documents with mathematics 
even when they are based on an alternative script? Starting from this material 
with TgX and H, will MathML be able to handle Arabic mathematics? 
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3 MathML and Arabic Mathematics 

Of course, semantically speaking, an Arabic mathematical expression is the same 
as a Latin-based one. Thus, only display problems need be taken into account. 
In any way, encoding semantics are beyond the scope of this paper. 

In order to know if there really is a need to construct a new tool or only 
to improve an already available one, what are the possibilities offered by the 
known MathML renderers? As much of the work is built around TgX, an open 
source community effort, it is hard to be precise about the current status of all 
T[ 5 X/MathML related projects. Most of these projects belong to one of three 
basic categories: 

— Conversions from TgX to MathML. Of particular note here, are O [5, 6] and 
TeX4ht [13], a highly specialized editor/DVI driver. Both of these systems 
are capable of writing presentation MathML from T[;]X documents. There 
are other converters such as LaTeX2HTML and tralics [1]. 

— Conversions from MathML to TgX- The conversion from MathML to 

can be done for instance, through reading MathML into Mathematica or 
other similar tools and then saving the result back out as TgX; or using 
Scientific Workplace for suitable BTgX sources. The ConTg^Xt system is 
another example. 

— Direct typesetting of MathML using T^X- 

Currently, MathML is supported by many applications. This fact shows not 
only that it is the format of choice for publishing equations on the web but also 
that it is a universal interchange format for mathematics. More than twenty 
implementations are listed on the MathML official website, showing that all 
categories of mathematical software can handle MathML. Actually, 

— most mathematical software, such as Scientific Workplace, Maple, MathCad 
and Mathematica, can export and import MathML; 

— all common browsers can display MathML equations either natively or using 
plug-ins; 

— editors such as MathType, Amaya, Tg^Xmacs, and WebEQ support MathML. 

Once non-free or non-open-source tools are omitted, two Web browsers re- 
main: the well-known Mozilla system [11] and Amaya. The W3C’s Amaya edi- 
tor/browser allows authors to include mathematical expressions in Web pages, 
following the MathML specification. Mathematical expressions are handled as 
structured components, in the same way and in the same environment as HTML 
elements. All editing commands provided by Amaya for handling text are also 
available for mathematics, and there are some additional controls to enter and 
edit mathematical constructs. Amaya shows how other W3C specifications can 
be used in conjunction with MathML. 

In the end, we chose to adapt Mozilla to the needs of the situation, mainly 
because of its popularity and widespread adoption as well as the existence of 
an Arabic version. The layout of mathematical expressions in Latin writing, and 
consequently that of the mathematical documents in Mozilla is more elegant and 
of good typographical quality compared to other systems. 
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For this implementation, we used the Mozilla 1.5 CH — h source under Linux. 
Until now, there was no Mozilla version with support for bidirectionality or cur- 
si vity in a mathematical environment. In math mode, only left-to-right arrange- 
ment is supported. Thus, the first step is to find out how to get bidirectionality 
and cursivity inside a MathML passage. 

In fact, adding the property of bidirectionality to MathML elements is a 
delicate task. It requires a careful study of the various possible conflicts. The 
bidirectionality algorithm for mathematical expressions is probably different 
from that originally in use for text. 

Now, let us have a look at what would happen if the bidirectionality algorithm 
for HTML were used for MathML elements. 

The MathML expression 



<im>l</iiin> 

<mo>-|-</ino> 

<mo>-</mo> 

<mn>2</inn> 

will be rendered as 1 -|- 2 — v instead of the expected equation: 1 — 2 . 

Since XML supports Unicode, we might expect that the introduction of 
Arabic text into MathML encoding would go without any problem. In other 
words, the Arabic text would be rendered from right to left, and letters would 
be connected just as they should be in their cursive writing. Will the use of the 
element <mtext> (similar to the use of the TgX command \hbox) be enough to 
get a satisfactory rendering of Arabic? 

The following Arabic text is a sample of what is obtained with <mtext> in 
Mozilla: 



<mtext> 

</intext> 






The following Arabic abbreviation of the cosine function is an example of 
what we get if we introduce it with <mi>: 






In order to allow the arrangement of sub-expressions from right to left in a 
given mother expression, a new element denoted <rl> is introduced^. 



^ The name rl reminds us of the initials of right-to-left. Furthermore, because of the 
expected heavy use of this element, its name should be as short as possible. 
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<mrow> 

<rl> 

<mi>o</mi> 

<mo>+</mo> + 

<mi>jj</</ini> 

</rl> 

</inrow> 

The use of the element <rl> also allows solving the previous problem of 
introducing Arabic text in a mathematical expression. 

<mtext> 

</mtext> 

<mi><rl>lx></rl></mi> 

The element <rl> can be used to transform some mathematical objects, such as 
left/right or open/close parentheses, into their mirror image. 

<rl> 

<mo><rl> [</rl></mo> 

<mi>^</ini> 

<mo> , </mo> ( 3 , V ] 

<mn>3</mii> 

<moXrl>) </rl></mo> 

</rl> 

We can remark here that the symbol has not changed to its mirror im- 
age “ The result is the same even when the comma is governed by <rl> (i.e., 
<rl>,</rl>). This symbol is not yet mentioned in the Bidi Mirroring list in the 
Unicode Character Database. 

Particular arrangement of the arguments is made necessary by the MathML 
renderer for any presentation elements requiring more than one argument. On 
the other hand, elements of vertical arrangement such as <mfrac> do not need 
special handling. 

<mfrac> 

<rl> 

<mi>o</mi> 

<mo>+</ino> + V 

<mi>jj«</ini> 3 

</rl> 

<mii>3</im> 

</mfrac> 
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Although the addition of <rl> helps to get Arabic text rendered as expected 
and to solve arrangement of sub-expressions within an expression, for certain 
elements, it does not work. 

Using the element <rl> to get a superscript element <msup> or a subscript 
element <msub>, in the suitable right to left positions, generates a syntax error 
because <msup> requires two arguments, whereas there is only one argument, as 
can be seen in the following example: 



<msup> 

<rl> 



<mi>^</mi> 

<mi>v^</mi> 



invalid- markup 



</rl> 

</msup> 



In this case, we introduce a new markup element <amsup>^. It changes the 
direction of rendering expressions while keeping the size of superscripts as it is 
with <msup>. 



<amsup> 

<mi>^</mi> 



'D“ 



<mi>vj</mi> 

</amsup> 



The same principle is applied to other elements like <msub>. The notation of 
the arrangement in the Arabic combination analysis is different from its Latin 
equivalent. 



<amarrange> 

<mi>J</mi> 

<mn>5</im> 

<nm>2</mn> 

</amarraiige> 



5 



The next step is related to the shape of some symbols. In Arabic mathematical 
presentation, certain symbols, such as the square root symbol or the sum, in some 
Arabic areas, are built through a symmetric reflection of the corresponding Latin 
ones. These symbols require first the introduction of a new font family such as 
the one offered in the Arabic Computer Modern fonts. This family corresponds 
to the Computer Modern fonts with a mirror effect on some glyphs. In the 
same way that the Computer Modern fonts are used in the Latin mathematical 
environment with <math> the new element <amath> will allow the use of the 
Arabic Computer Modern fonts in the Arabic mathematical environment. The 

^ The following new elements defined in this system are prefixed with the initial of 
Arabic “a”. 
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element <aunath> would not be necessary if the Arabic mathematical symbols 
were already added in the Unicode tables. In fact, we use the same entity name 
and code for some symbols and their mirror images used in the Arabic presen- 
tation. For example, the Unicode name N-ARY SUMMATION coded by U+02211 
is associated simultaneously to the Latin sum^ symbol to its Arabic 

equivalent mirror 2. Thus, to specify which glyph, and consequently which font, 
is called, the introduction of a new element <ELmath> is necessary. This element 
would not be necessary if the symbols were denoted with two different entity 
names and consequently two different codes. 

<amath> 

<rl> 

<mstyle displaystyle="true"> 

<munderover> 

<mo>&suin; </mo> 

<mrow> 

<rl> 

<mi>^</mi> 

<mo>=</mo> 

<mn>l</mn> 

</rl> 

</mrow> 

<mi>c_3</ini> 

</ munder o ver > 

</mstyle> 

</rl> 

</ amath> 

In order to distinguish alphabetical symbols, in different shapes, from letters 
used in Arabic texts, and to avoid the heterogeneity resulting from the use of 
several fonts, there is a need for a complete Arabic mathematical font. That’s 
exactly what we are trying to do in another project discussed elsewhere in this 
volume [10]. While waiting for their adoption by Unicode, the symbols in use 
in this font will be located in the Private Use Area E000-F8FF in the Basic 
Multilingual Plane. 

<mi>&#xE004 ; </mi> 5^ 

The use of the Arabic Computer Modern fonts is not enough for composed 
symbols. For example, the square root symbol is composed of the square root 
glyph supplemented by an over bar. This over bar is added by the renderer, 
which, thanks to a calculation of the width of the base, gives the length of this 
over bar. 

® Introduced as ∑ or ∑. 
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In this case, neither the inversion of the glyph nor the use of the right-to-left 
element <rl> changes the direction of the visual rendering of the square root. 
For this reason we have introduced a new element (<amsqrt>), which uses the 
square root glyph from the Arabic Computer Modern font that shows the over 
bar to its left. 



<amath> 

<amsqrt> 

<mfrac> 

<rl> 

<mi>i.j</ini> 

<mo>+</ino> 

</rl> 

<mn>3</inn> 

</mfrac> 

</amsqrt> 

</ amath> 




The root element <ajnroot> requires a treatment similar to that of the square 
root combined with the positioning of the index on the right of the baseline. 



<ainath> 

<amroot> 

<im>3</mn> 

</amroot> 

</amath> 




For the elements <munderover>, <munder>, <mover>, and <msubsup>, italic 
correction needs to be done. In fact, mathematical symbols like the integral 
are slanted and the indices and exponents are shifted in the direction of the 
symbol’s slant. This fact appears clearly in the following example representing 
two integrals, while using <ainsubsup> in the first: 
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<amath> 

<rl> 

<mstyle displaystyle="true"> 
<amsubsup> 

<mo>&int ; </mo> 

<mn>0</mn> 

<mn>l</mn> 

</amsubsup> 

</mstyle> 

<amsup> 

<mi>^</mi> 

<mi>v^</mi> 

</amsup> 

<mi>»</ini> 




<mi>^</ini> 

</rl> 

</amath> 



or <Emmnderover> in the second: 



<amath> 

<rl> 

<mstyle displaystyle="true"> 
<amunderover> 

<mo>&int ; </mo> 

<mn>0</inn> 

<mn>l</mn> 

</amunderover> 

</mstyle> 

<amsup> 

<mi>^</ini> 

<mi>v^</ini> 

</amsup> 

<mi>»</mi> 




<mi>^</ini> 

</rl> 

</amath> 



For the limit of an expression, manual lengthening of the limit symbol is 
performed. Of course, dynamic lengthening via automatic calculation of the 
width of the text under the limit sign would be better. 
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■DC ^ ^ 

A lengthening of the straight line is not in conformity with the rules of 
Arabic typography. A curvilinear lengthening is required, which can be obtained 
by using CurExt [9], which makes it possible to stretch Arabic letters according 
to calligraphic rules. 

The following mathematical expression is an example of the use of <mover>, 
with automatic lengthening of the over arrow. 

V + O" 

In fact, the use of element <rl> doesn’t represent a very practical solution 
as the encoding becomes heavier. The addition of this element must be trans- 
parent for the user; the same for all other new elements since they affect only 
the presentation and not the semantics of expression. An alternative solution 
consists of either building a new algorithm of bidirectionality for mathematics, 
or of adding attributes that will make it possible to choose the mathematical 
notation of the expression. We intend to use a new attribute nota for the root 
element <math>. It would indicate whether Arabic or Latin is used inside the 
mathematical expression. As the layout of a mathematical expression follows a 
precise logic, the direction of writing would be handled automatically without 
requiring the use of direction attributes for each child of the element <math>. 

The FIGUE [3] system is an engine for the interactive rendering of structured 
objects. It allows the rendering of an Arabic text from right to left including some 
Latin mathematical expressions flowing from left to right thanks to a proposed 
bidirectional extension of MathML. 

4 Conclusion 

Our goal was to identify the difficulties and limitations that might obstruct the 
use of MathML for writing mathematics in Arabic. The main adaptation we 
made to MathML for Arabic mathematics was the addition of the element <rl> 
that allows: 

— writing mathematical expressions from right-to-left; 

~ the use of specific symbols thanks to the modification of other elements; 

— and handling the cursivity of writing. 

Now, Arabic mathematical e-documents can be structured and published on 
the Web using this extended version of Mozilla. Such documents can thus benefit 
from all the advantages of using MathML. Our project for the development of 
communication and publication tools for scientific and technical e-documents in 
Arabic is still at its beginning. We hope that the proposals contained in this 
paper will help to find suitable recommendations for Arabic mathematics in 
Unicode and MathML. 
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Abstract. Ten years of experience with publishing of the GUST 
bulletin shows that Knuth’s dream of highly portable T/eX Hies is appar- 
ently an illusion in practice. Over the last decade, articles in the GUST 
bulletin have used at least six major formats (UTgX 2.09, transitional 
LXfjEX-l-NFSS, IAIjEX2e, plain-based TUGboat, Eplain, and ConT/EXt), 
numerous macro packages, fonts, and graphic formats. Many old articles 
are typeset differently nowadays, and some even cause TeX errors. 

This situation motivates the following question: how do we avoid the 
same problem in the future? As the World Wide Web is quickly becoming 
the mainstream both of publishing and of information exchange we argue 
for migration to XML - a Web compatible data format. 

In the paper we examine a possible strategy for storing GUST articles 
in a custom XML format and publishing them with both ffjipC and 
XSLT/FO. Finally, the problems of converting the T/e^ files to XML 
and possibility of using TEX4ht - an authoring system for producing 
hypertext - are discussed. 



1 Introduction 

The dominant role played by the Web in information exchange in modern times 
has motivated publishers to make printed documents widely available on the 
Internet. It is now common that many publications are available on the Web 
only, or before they are printed on paper. Articles published in the GUST 
bulletin are available on the Web in POSTSCRIPT and PDF. Unfortunately, 
these formats decrease document accessibility, searching and indexing by Web 
search engines. For broad accessibility to automated services, it is better to 
use XML as the format of such data. However, one issue with XML is that it 
is difficult to maintain the high quality presentation of Tj^X documents. This 
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is caused by incompatibilities between browsers and incomplete or immature 
implementations of W3C Consortium standards. 

We are optimistic that these issues will disappear in the near future, and 
believe that XML will become pervasive in the online environment. However, 
in our context, a key to the adoption of XML is the degree to which it can be 
integrated with existing T[;<]Xnologies. 

In this paper we examine one strategy for storing GUST articles in a custom 
XML format and publishing them with both TJcX and XSLT/FO. Also, the 
problems of converting the existing TgX files to XML and the possibility of 
using Tf;]X4ht - an authoring system for producing hypertext - are discussed. 

2 and Other Document Formats 

When the authors started work with TJcX (many years ago), there was only a 
choice between closed-source applications based on proprietary formats, or TJnX, 
for publishing significant documents. Nowadays, the choice is much wider, as 
XML-based solutions are based on open standards and supported by a huge 
number of free applications. We do not need to write the tools ourselves. Thus 
the strategy of reusing what is publicly available is key in our migration plan. 

On the other hand it would be unwise to switch to XML as the only ac- 
ceptable submission format, because it would force many authors to abandon 
their powerful T(C^“based editing environments to which they are accustomed, 
just to submit texts to our bulletin. Following this route, we would more likely 
end up with a shortage of submissions. Thus, we are preparing a mixed strategy 
with both TJ;]X and XML as accepted formats. Papers submitted in LT[5]X will 
ultimately be converted to XML as an archival or retrieval format. Presentation 
formats will be XHTML, with corresponding PDF generated by a variety of 
tools. The work-flow of documents in this proposed framework is depicted on 
Fig. 1. 

The XML implementation project described in the paper can be broadly 
divided into the following subtasks: DTD development, formatting development, 
and legacy information conversion [19]. We’ll now describe these stages in detail. 

3 DTD Development Considerations 

There is no doubt (see for example [14,19]) that the DTD development phase 
is of critical importance for the overall success of any SGML/XML project. 
Fortunately, thanks to the great interest in XML technology in recent years, 
there are several production-quality publicly available DTDs which could be 
adapted for our project. To make this choice, we preferred those which are widely 
used and for which formatting applications and tools are available. The following 
possible schemes were considered: 

— DocBook [21], a complete publishing framework, i.e., schemes plus XSLT or 

DSSSL stylesheets for conversion to presentation formats; actively developed 
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Fig. 1. Processing diagram for XML/IXQ3X documents. 



and maintained; the de facto standard of many Open Source projects; widely 
known and used. 

— TEI [1], another complete, actively developed publishing framework. Not as 
popular as DocBook, used mainly in academia. 

— The lATEX-based DTD developed in [7] (further referred as LWC DTD). 
The similarity to the structure of DTgX is an advantage of this DTD for our 
project. 

— Others, such as DTD for GCA/Extreme conferences, X-DiMi from the 
Electronic Thesis and Dissertations Project, and the DT[T;X-like PMLA de- 
veloped by one of the authors [15]. 

Industry-standard DTDs tend to be too big, too complex, and too general for 
practical use in specific cases (cf. [14, p. 29]). In particular, the DocBook and 
TEI DTDs seem to be too complex for marking-up documents conforming to 
DT^X format. 

As a result, users frequently employ the technique of using different DTDs 
at different stages of the editorial process. Following Maler [14], we will call the 
DTD common to a group of users within an interest group as a reference DTD, 
while those used solely for editing purposes as an authoring DTD. Translation 
from one DTD to another may be easily performed with an XSLT stylesheet. 

We decided to use a simplified LWC DTD as authoring DTD and DocBook 
as reference DTD. Providing a simple DTD should expand the group of prospec- 
tive authors. For example, many GUST members are experts in typography or 
Web design but not necessarily TgX hackers. 
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The simplification consists of restricting the document hierarchy only to 
article-like documents, and removing back matter tags (index, glossary) and 
all presentation tags (newline, hspace, etc.). Also, the optional status of meta- 
data, for example the title, abstract, keywords tags, was changed to required. 
The resulting DTD contains 45 elements compared to 64 in the original one. 

For better maintainability, we rewrote our version of LWC DTD into RNC 
syntax. The RNC schema was introduced by Clark [6], and recently adopted as 
an ISO standard. It has many advantages over DTD or W3C Schema syntax, 
namely simplicity and an included documentation facility^. 

As the structure of our documents is not particularly complex, it may be fea- 
sible to develop several authoring DTDs targeted at different groups of authors, 
for example one for those preferring ConTgXt-like documents, another for those 
used to GCA conference markup, etc., and then map those documents to the 
reference DTD with XSLT. 



4 Formatting with XSLT 

For presentation, LWC documents are first transformed to DocBook with a 
simple XSLT stylesheet. 

The DocBook XSL stylesheets [22] translate an XML document to XHTML 
or FO [18]. As they are written in a modular fashion, they can be easily cus- 
tomized and localized. To publish XHTML from XML documents, an XSLT 
engine is needed such as Kay’s saxon [11] or Veillard’s xsltproc [20]. 

For hard copy output, a two-step process is used. First, the XSLT engine 
produces formatting objects (FO) which then must be processed with a format- 
ting object processor for PDF output^. The detailed transformation work-flow 
is depicted in Fig. 2. 




Fig. 2. Processing details of LWC documents with XSLT/FO. 



^ It is possible to convert between different schema langnages for XML with the 
trang program [5]. There is also a nxml-mode for GNU Emacs for editing of XML 
which features highlighting, indentation, and on the fly validation against an RNC 
schema [4]. 

^ Modern browsers have XSLT engines built-in. So, it suffices to attach to a document 
appropriate stylesheets to make the transformation on the fly. 
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With just a few customizations the translation from XML to XHTML 
presents no obstacles (except for math formulas). On the other hand, the quality 
of the PDF produced with the publicly available fop processor from the Apache 
project is poor compared to that obtained with Tjr;X. 

Instead of generating FO objects one can use XSLT to translate XML di- 
rectly to high-level DTeX. That is the method used in db2latex [3] (see also a clone 
project: dblatex/dbcontext [9]; the latter, of course, generates files processable by 
ConTj;]Xt). The output can be customized at XSLT stylesheet level as well as 
by redefining appropriate DTeX style files. MathML markup is translated with 
XSLT to DTeX and supported natively^. 

The translation from DocBook to implemented in these tools is incom- 

plete. To get reasonable results, prior customization to local needs is required. 
The main advantage of this approach is that we use T[; 5 X - a robust and well 
known application. 



5 The GUST Bulletin Archive 

When considering the conversion of the GUST archive to XML we have two 
points in mind: first, we recognize the long-term benefits of an electronic archive 
of uniformly and generically marked-up documents; and second, to take the 
opportunity to test the whole framework using ‘real’ data. 

During the last 10 years, 20 volumes of the GUST bulletin were published, 
containing more than 200 papers. From the very beginning GUST was tagged in 
a modified TUGBOAT style [2]. The total output is not particularly impressive, 
but the conversion of all articles to XML isn’t a simple one-night job for a bored 
ihcX hacker: 

— they were produced over an entire decade and were written by over 100 
different authors. 

— they were processed with at least six major formats (DT[5]X 2.09, transi- 
tional DdJiA+NFSS, DTeX 2£, plain-based TUGBOAT, eplain, and finally 
ConTgXt), using numerous macro packages, fonts, and graphic formats^. 

As a group, the GUST authors are not amateurs, producing naive code. 
On the contrary they are experts, writing on a diverse range of subjects 
using non-standard fonts, packages and macros. For example. Fig. 3 shows the 
detailed distribution of the TgX formats used in GUST. 

In total, there were 134 plain TffpC articles, compared to 87 for DTeX. DTeX 
authors used 74 different packages, while those preferring plain T[;]Xnology used 
139 different style files. The proportion of other formats (Eplain, ConTgXt, 
BLUE) was insignificant (only a few submissions). It can also be noted from 

® One approach which we did not try is to format FO files with TEX. This method is 
implemented by S. Rahtz’ Passive TEX [17]. 

^ Needless to say, all of these packages have been evolving during the last 10 years, 
many of them becoming incompatible with each other. 
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plain latex other total 



Fig. 3. Distribution of T]eX formats used by GUST authors. 



Fig. 3 that in recent years, the proportion of plain TJi]X submissions has decreased 
substantially in favor of F^TeX. 

It is obviously very difficult to maintain a repository containing papers re- 
quiring such a diverse range of external resources (macros, fonts). As a result, 
many old papers can no longer be typeset due to changes in underlying macros 
or fonts. 

6 Conversion from to XML 

It may be surprising that only few papers report successful conversion from 
to XML: Grim [8] describes successful large-scale conversion in a large academic 
institution, while Rahtz [16] and Key [12] describe translation to SGML at 
Elsevier. 

Basically when converting TgX to XML the following three approaches have 
been adopted [16]: 

— Perl/Python hackery combined with manual retyping and/or manual XML 
marking-up. 

~ Parsing source files not with tex, but with another program which gen- 
erates SGML/XML. This is the approach used by ltx2x [23], tralics [8] 
and latex2html^, which replace IM^X commands in a document by user- 
defined strings. 

— Processing files with TgX and post-processing the DVI files to produce 
XML. This is the way tex4ht works. 

Although the conversion performed with tralics is impressive, we found the 
application very poorly documented. After evaluation of the available tools and 
consulting the literature [7], we decided to use TgX4ht - a TgX-based authoring 
system for producing hypertext [10]. 

Iatex2html was not considered as its output is limited to HTML. 
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Because TgX formats contain many visually-oriented tags, we could not 
expect to automatically convert them to content-oriented XML markup®. 

For example, the TUGBOAT format requires only the metadata elements 
title and author name(s); author address(es) and webpage(s) of the author(s) are 
often absent and there is no obligation for abstracts and keywords. Therefore, 
most of the GUST articles lack these valuable elements. Moreover, bibliographies 
are inconsistently encoded^. 

Having said that, our plan is to markup as many elements as possible. 



7 Translating to XML with 1^^4ht 

Out of the box, the TgX4ht system is configured to translate from plain, UTeX, 
TUGBOAT (Itugboat, Itugproc), and Lecture Notes in Computer Science (li- 
nes) formats to HTML, XHTML, DocBook, or TEL To translate from, say, 
TUGBOAT to our custom XML format the system needs to be manually 
configured. Because the configuration of TgX4ht from scratch is a non-trivial 
task, we consider other more efficient possibilities. 

The TgX4ht system consists of three parts: (1) Style files which enhance 
existing macros with features for the output format (HTML, XHTML, etc.)®. 
(2) The tex4ht processor which extracts HTML or XHTML, or DocBook, or 
TEI files from DVI files produced by tex. (3) The t4ht processor which is 
responsible for translating DVI code fragments which need to be converted to 
pictures; for this task the processor uses tools available on the current platform. 

As mentioned above, the conversion from a visual format to an information- 
oriented one cannot be done automatically. Let’s illustrate this statement with 
the following example marked with plain TgX macros®: 

\no indent {\bf exercise, left as em} {\it adj\/} {\ss Tech}- Used 
to complete a proof when one doesn’t mind a {\bf handwavef, or 
to avoid one entirely. The complete phrase is: {\it The proof 
\rm(or \it the rest\/\rm) \it is left as an exercise for the 
reader. \/} This comment has occasionally been attached to 
unsolved research problems by authors possessed of either an 
evil sense of humor or a vast faith in the capabilities of 
their audiences . \h£uigindent=lem 



® For example, see [16,8]. Other examples, based on GUST articles, are presented 
below. 

^ Publicly available tools (see [13] for example) can automatically mark up manually 
keyed bibliographies. 

® Altogether over 2.5M lines of TiX code. Compare this with IM code of the DTLX 
base macros. 

® The text comes from “The Project Gutenberg Etext of The Jargon File”, Version 

4.0.0. 
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After translation of this fragment to XHTML by tex4ht we obtain: 

<p class="noindent"Xspan class="cmbx-10">exercise , left as an 
</span><span class="cmti-10">adj </spanXspan 
class="cmss-10">Tech </span>Used to complete a proof when one 
doesnfc#x2019 ; t mind a <span class="cmbx-10">handwave</span>, 
or to avoid one entirely. The complete phrase is: <span 
class="cmti-10">The proof </span>(or <span class="cmti-10">the 
rest</span>) <span class="cmti-10">is left as an exercise for 
the reader. </span>This comment has occasionally been attached 
to unsolved research problems by authors possessed of either 
an evil sense of humor or a vast faith in the capabilities 
of their audiences . </p> 

and this could be rendered by a browser as: 

exerdse, left as an adj T ech Used to complete a proof when one doesn't mind 
a handwave, or to avoid one entirely. The complete phrase is: The proof (or the 
rest) is left as an exercise for the reader. This comment has occasionally been 
attached to unsolved research problems by authors possessed of either an evil 
sense of humor or a vast faith in the capabilities of their audiences. 



We can see that tex4ht uses ‘span’ elements to mark up font changes. These 
visual tags could be easily remapped to logical ones unless fragments of text 
with different meaning are marked with the same tag. Here the tag cmti-10 
was used to tag both the short form ‘adj’ and the example phrase (shown in the 
green italic font). To tag them differently we need different T[^]X macros specially 
configured for T[;]X4ht. Note that the \hcLngindent=lem was ignored by tex4ht. 
This command could not be fully simulated, because T[^]X’s hanging indentation 
is not supported by browsers in full generality. 

So, the markup produced by the tex4ht program is not logical markup. To 
get logical markup the GUST format should be reworked and reconfigured for 
TgX4ht. 

Instead of configuring TgXdht we could use an XSLT stylesheet to remap 
elements referencing XML format. This could be an easier route than configuring 
the system from scratch, while some T[;]X4ht configuration could also help. So, 
a combination of the two methods is envisaged to provide the best results. 



8 Conclusion and Future Work 

We have not completed the conversion yet. However, based on the experience 
gained so far we can estimate that almost 70% of the whole archive should be 
converted with little manual intervention. Semi-automatic conversion of another 
15% (34 papers) is possible, with prior extensive changes in markup. Conversion 
of remaining 15% is impossible or useless, where ‘impossible’ means the paper 
is easier to retype than try to recompile and adjust tex4ht just for a particular 
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single case, and ‘useless’ applies to papers demonstrating complicated graphical 
layouts, or advanced typesetting capabilities of T[;]X. 

Although our system needs improvement - conversion of math is the most 
important remaining item to investigate - we are determined to start to use it 
in a production environment. 

Finally, we note that many of our conclusions and methods are also applicable 
to TUGBOAT, because the format used for typesetting GUST bulletin differs 
only slightly from the one used for TUGBOAT. 
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Abstract. This paper presents a new approach for creating animations 
in Portable Document Format (PDF). The method of animation au- 
thoring described uses free software (pdfTf^X) only. The animations are 
viewable by any viewer that supports at least some features of Acrobat 
JavaScript, particularly Adobe (Acrobat) Reader, which is available at 
no cost for a wide variety of platforms. Furthermore, the capabilities 
of PDF make it possible to have a single file with animations both for 
interactive viewing and printing. 

The paper explains the principles of PDF, Acrobat JavaScript and 
pdfIJrjX needed to create animations for Adobe Reader using no other 
software except pdflJ^X. We present a step by step explanation of 
animation preparation, together with sample code, using a literate pro- 
gramming style. Finally, we discuss other possibilities of embedding 
animations into documents using open standards (SVG) and free tools, 
and conclude with their strengths and weaknesses with respect to the 
method presented. 



1 Introduction 

Extensive use of electronic documents leads to new demands being made on 
their content. Developing specific document versions for different output devices 
is time consuming and costly. A very natural demand, especially when preparing 
educational materials, is embedding animations into a document. 

A widely used open format for electronic documents is the Adobe PDF [2] for- 
mat, which combines good typographic support with many interactive features. 
Even though it contains no programming language constructs such as those found 
in PostScript, the format allows for the inclusion of Document Level JavaScript 
(DLJS) [1]. Widely available PDF viewers such as Adobe Reader (formerly 
Acrobat Reader) benefit from this possibility, allowing interactive documents to 
be created. 

One of the first applications showing the power of using JavaScript with PDF 
was Hans Hagen’s calculator [5]. Further, the AcroTgX bundle [9] uses several 
DTh;X packages and the full version of the Adobe Acrobat software for preparing 
PDF files with DLJS [10]; macro support for animations is rudimentary and 
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it is stressed in the documentation that it works only with the full commercial 
version of Acrobat. 

Our motivation is a need for PDF animations in a textbook [3] published 
both on paper and on CD. We have published it using Acrobat [7,8], and 
eventually discovered a method to create animations using pdfT[;]X [11] only. 

pdfTi^X facilitates the PDF creation process in several ways. We can directly 
write the PDF code which is actually required to insert an animation. We can 
also utilise the TgX macro expansion power to produce PDF code. And finally, 
we can write only the essential parts directly, leaving the rest to pdfTp^X. pdflfi^X 
introduces new primitives to take advantage of PDF features. The ones we are 
going to use will be described briefly as they appear. 

In this paper, we present this new ‘pdfT[5]X only’ way of embedding ani- 
mations. We require no previous knowledge either of the PDF language or of 
pdflf^X extensions to T[i]X. However, the basics of Tg]X macro definitions and 
JavaScript are assumed. 

The structure of the paper is as follows. In the next section we start with the 
description of the PDF internal document structure with respect to animations. 
The core of the paper consists of commented code for the pdflg^X that generates 
a simple all-in-one animation. The examples are written in plain TgX [6], so that 
others can use it in elaborate macro packages, in a literate programming style. 
In the second example the animation is taken from an external file, allowing the 
modification of the animation without modifying the primary document. Finally, 
we compare this approach with the possibilities of other formats, including the 
new standard for Scalable Vector Graphics (SVG) [12] from the W3C. 

2 The PDF Document Structure 

A PDF file typically consists of a header, a body, a cross-reference table and 
a trailer. The body is the main part of the PDF document. The other parts 
provide meta-information and will not be discussed here. A PDF document is 
actually a graph of interconnected objects, each being of a certain type. There 
are basic data types (boolean, numeric, string) and some special and compound 
types which require some explanation. 

A name object has the form /MYNAME. There is a set of names with 
predefined meanings when used as a dictionary key or value. Other names can be 
defined by the user as human readable references to indirect objects (dictionaries 
and indirect objects are treated below). An array object is a one-dimensional 
list, enclosed by square brackets, of objects not necessarily of the same type. A 
dictionary object is a hash, i.e., a set of key- value pairs where the keys are name 
objects and the values are arbitrary objects. A dictionary is enclosed by the << 
and >> delimiters. Stream objects are used to insert binary data into a PDF 
document. There is also a special null object used as an “undefined” value. 

The body of a PDF file consists of a sequence of labelled objects called 
indirect objects. An object of any other type which is given a unique object 
identifier can form an indirect object. When an object is required in some place 
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(an array element, a value of a key in a dictionary), it can be given explicitly 
(a direct reference) or as an object identifier to an indirect object (an indirect 
reference) . In this way objects are interconnected to form a graph. An indirect 
reference consists of two numbers. The first number is a unique object number. 
The second is an object version number and is always 0 in indirect objects 
newly created by pdfT[;]X - the first one therefore suffices to restore an indirect 
reference. 

Various document elements are typically represented by dictionary objects. 
Each element has a given set of required and optional keys for its dictionary. 
For example, the document itself is represented by a Catalog dictionary, the 
root node of the graph. Its key-value pairs define the overall properties of the 
document. A brief description of concrete objects will be given when encountered 
for the first time. See [2] for more detailed information. 

3 Insertion of the Animation Frames 

We are not interested in constructing the animation frames themselves - any 
graphics program such as METHP05T will do. Let us hence assume we have a 
PDF file, each page of which forms a single animation frame and the frames are 
in the order of appearance. 

Every image is inserted into PDF as a so-called form XObject which is 
actually an indirect stream object. There are three primitives that deal with 
images in pdfT[;]X- The \pdf ximage creates an indirect object for a given image. 
The image can be specified as a page of another PDF file. However, the indirect 
object is actually inserted only if referred to by the \pdfrefximage primitive 
or preceded by \immediate. \pdf ref ximage takes an object number (the first 
number of indirect reference) as its argument and adds the image to the TgX list 
being currently built. The object number of the image most recently inserted by 
\pdfximage is stored in the \pdf lastximage register. 

A general PDF indirect object can be created similarly by \pdfobj, \pdf- 
refobj and \pdflastobj. \pdfobj takes the object content as its argument. 
Tfi]X macro expansion can be used for generating PDF code in an ordinary 
manner. 

In our example, we first define four macros for efficiency. The \ximage macro 
creates a form XObject for a given animation frame (as an image) and saves its 
object number under a given key. The \insertobj macro creates a general PDF 
object and saves its object number under a given key. The \oref macro expands 
to an indirect reference of an object given by the argument. The last “R” is an 
operator that creates the actual indirect reference from two numbers. We are not 
going to use \pdfref* primitives, so \immediate must be present. References 
will be put directly into the PDF code by the \oref macro. The Ximage macro 
actually places an image given by its key onto the page. 
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1 7, an image for further use 

2 \def \ximage#l#2{7. 

3 \immediate\pdf ximage 

4 page #2 {frcunes-in.pdf}y, 

5 \expandaf terXedef 

6 \csname pdf : #l\endcsncune 

7 {\the\pdf lastximage}} 

8 

9 7. a general object for further use 

10 \def \insertobj#l#2{7« 

11 \immediate\pdf obj{#2}7« 

12 \expandaf terXedef 

13 Xcsname pdf : ttlXendcsneune 

14 {XtheXpdf lastobj}} 

15 

16 7. expands to an indirect ref. for a key 

17 Xdef Xoref #1{7. 

18 Xcsname pdf : #lXendcsnameXspace 0 R} 

19 

20 7. actually places an image 

21 Xdef Ximage#! {7. 

22 XexpandafterXpdfref ximage 

23 Xcsname pdf : #lXendcsncune} 

Another new primitive introduced by pdflJ^X is Xpdf catalog. Its argument 
is added to the document’s Catalog dictionary every time it is expanded. The 
one below makes the document open at the first page and the viewer fit the page 
into the window. One more key will be described below. 

24 7. set up the document 

25 Xpdf catalogb/OpenAction [ 0 /Fit ]} 

Now we are going to insert animation frames into the document. We will use 
the Xximage macro defined above. Its first argument is the name to be bound 
with the resulting form XObject. The second one is the number of the frame 
(actually a page number in the PDF file with frames). One needs to be careful 
here because pdflj^^X has one-based page numbering while PDF uses zero-based 
page numbering internally. 

26 7. all animation frames are inserted 

27 Xximageff r0}{!} Xximage{frl}{2} 

28 Xximageff r2}{3} Xximage{fr3}{4} 

29 Xximageff r4}{5} Xximage{fr5}{6} 

30 Xximageff r6}{7} Xximage{fr7}{8} 

31 Xximageff r8}{9} 




Animations in pdfTJiX-Generated PDF 



183 



4 Setting Up an AcroForm Dictionary 

The interactive features are realized by annotation elements in PDF. These 
form a separate layer in addition to the regular document content. Each one 
denotes an area on the page to be interactive and binds some actions to various 
events that can happen for that area. Annotations are represented by An not 
dictionaries. The way pdfTffpC inserts annotations into PDF is discussed in the 
section “Animation Dynamics” below. 

Annotations are transparent by default, i.e., the page appearance is left 
unchanged when adding an annotation. It is up to the regular content to provide 
the user with the information that some areas are interactive. 

We will be interested in a subtype of annotations called interactive form 
fields. They are represented by a Widget subtype of the An not dictionary. Widgets 
can be rendered on top of the regular content. However, some resources have to 
be set. The document’s Catalog refers to an AcroForm dictionary in which this 
can be accomplished. 

The next part of the example first defines the name Helv to represent the 
Helvetica base- font (built in font). This is not necessary but it allows us to 
have a smooth control button. Next we insert the AcroForm dictionary. The 
DR stands for “resource dictionary”. We only define the Font resource with one 
font. The DA stands for “default appearance” string. The /Helv sets the font, 
the 7 Tf sets the font size scale factor to 7 and the 0 g sets the color to be 
0% white (i.e., black). The most important entry in the AcroForm dictionary 
is NeedAppearances. Setting it to true (line 43) makes the Widget annotations 
visible. Finally, we add the AcroForm dictionary to the document’s Catalog. 

32 7, the Helvetica basefont object 

33 \insertobj{Helv}{ 

34 << /Type /Font /Subtype /Typel 

35 /Name /Helv 

36 /BaseFont /Helvetica >> } 

37 

38 7. the AcroForm dictionary 

39 \insertobj{AcroForm}{ 

40 « /DR « /Font « 

41 /Helv \oref{Helv} >> >> 

42 /DA (/Helv 7 Tf 0 g ) 

43 /NeedAppearances true >> } 

44 

45 7. add a reference to the Catalog 

46 \pdf catalogf/AcroForm \oref {AcroForm}} 

To make a form XObject with an animation frame accessible to JavaScript, 
it has to be assigned a name. There are several namespaces in PDF in which 
this can be accomplished. The one searched for is determined from context. 
We are only interested in an AP namespace that maps names to annotation 
appearance streams. pdflJ^X provides the \pdfnames primitive that behaves 
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similarly to \pdf catalog. Each time it is expanded it adds its argument to 
the Names dictionary referred from document’s Catalog. The Names dictionary 
contains the name definitions for various namespaces. In our example we put 
definitions into a separate object AppearanceNames. 

The name definitions may form a tree to make the lookup faster. Each node 
has to have Limits set to the lexically least and greatest names in its subtree. 
There is no extensive set of names in our example, so one node suffices. The 
names are defined in the array of pairs containing the name string and the 
indirect reference. 

47 7, defining names for frames 

48 \insertobj {AppearanceNames}! 

49 << /Names 

50 [ (frO) \oref{fr0} (frl) \oref{frl} 

51 (fr2) \oref{fr2} (frS) \oref{fr3} 

52 (fr4) \oref{fr4} (fr5) \oref{fr5} 

53 (fr6) \oref{fr6} (fr7) \oref{fr7} 

54 (fr8) \oref{fr8} ] 

55 /Limits [ (frO) (fr8) ] >> } 

56 

57 7. edit the Names dictionary 

58 \pdfnames{/AP \oref {AppearanceNames}} 

5 Animation Dynamics 

We have created all the data structures needed for the animation in the previous 
section. Here we introduce the code to play the animation. It uses Acrobat 
JavaScript [I], an essential element of interactive forms. Acrobat JavaScript is 
an extension of Netscape JavaScript targeted to PDF and Adobe Acrobat. Most 
of its features are supported by Adobe Reader. They can, however, be supported 
by any other viewer. Nevertheless, the Reader is the only one known to us that 
supports interactive forms and JavaScript. 

The animation is based on interchanging frames in a single widget. Here we 
define the number of frames and the interchange timespan in milliseconds to 
demonstrate macro expansion in JavaScript. 

59 7. animation properties 

60 \def \f rames{8} 

61 \def \timespan{550} 

Every document has its own instance of a JavaScript interpreter in the 
Reader. Every JavaScript action is interpreted within this interpreter. This 
means that one action can set a variable to be used by another action triggered 
later. Document-level JavaScript code, e.g., function definitions and global vari- 
able declarations, can be placed into a JavaScript namespace. This code should 
be executed when opening the document. 

Unfortunately, there is a bug in the Linux port of the Reader that renders 
this generally unusable. The document level JavaScript is not executed if the 
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Reader is not running yet and the document is opened from a command line 
(e.g., ‘acroread file.pdf’). Neither the first page’s nor the document’s open 
action are executed, which means they cannot be used as a workaround. Binding 
a JavaScript code to another page’s open action works well enough to suffice in 
most cases. 

We redeclare everything each time an action is triggered so as to make the 
code as robust as possible. First we define the Next function, which takes a 
frame index from a global variable, increases it modulo the number of frames 
and shows the frame with the resulting index. The global variable is modified. 

The animation actually starts at line 78 where the frame index is initial- 
ized. The frames are displayed on an interactive form’s widget that we name 
"animation" - see “Placing the Animation” below. A reference to this widget’s 
object is obtained at line 79. Finally, line 80 says that from now on, the Next 
function should be called every \timespan milliseconds. 

62 7, play the animation 

63 \insertobj{actionPlay}{ 

64 << /S /JavaScript /JS ( 

65 function NextO { 

66 g. delay = true; 

67 if (cntr == \frames) { 

68 cntr = 0 ; 

69 try { app.clearlnterval(arun) ; } 

70 catch(except) {} 

71 } else { cntr++; } 

72 g.buttonSetlconC 

73 this.getlconC'fr" + cntr)); 

74 g.delay=false; 

75 } 

76 try { app . clearlnterval (arun) ; } 

77 catch(except) {} 

78 var cntr = 0 ; 

79 var g = this.getFieldC'animation") ; 

80 var arun = app . setlnterval ("Next ()" , 

81 \timespan) ; 

82 ) » } 

Now, let us describe the Next function in more detail. Line 66 suspends 
widget’s redrawing until line 74. Then the global variable containing the current 
frame index is tested. If the index reaches the number of frames, it is set back 
to zero and the periodic calling of the function is interrupted. The function 
would be aborted on error, but because we catch exceptions this is avoided. The 
get Icon function takes a name as its argument and returns the reference to the 
appearance stream object according to the AP names dictionary. This explains 
our approach of binding the names to animation frames - here we use the names 
for retrieving them. The buttonSetlcon method sets the object’s appearance 
to the given icon. 
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Line 76 uses the same construct as line 69 to handle situations in which 
the action is relaunched even if the animation is not finished yet. It aborts the 
previous action. It would have been an error had the animation not been running, 
hence we must use the exception catching approach. 



6 Placing the Animation 

The animation is placed on an interactive form field - a special type of annota- 
tion. There are two primitives in pdflJ^X, \pdf startlink and \pdfendlink, 
to produce annotations. They are intended to insert hyperlink annotations 
but can be used for creating other annotations as well. The corresponding 
\pdf startlink and \pdfendlink must reside at the same box nesting level. 
The resulting annotation is given the dimensions of the box that is enclosed by 
the primitives. We first create a box to contain the annotation. Note that both 
box and annotation size are determined by the frame itself - see line 91 where 
the basic frame is placed into the regular page content. 

We will turn now to the respective entries in the annotation dictionary. The 
annotation is to be an interactive form field (/Subtype /Widget). There are 
many field types (FT). The only one that can take any appearance and change 
it is the pushbutton. It is a special kind of button field type (/FT /Btn). The 
type of button is given in an array of field bit flags Ff. The pushbutton has to 
have bit flag 17 set (/Ff 65536). To be able to address the field from JavaScript 
it has to be assigned a name. We have assigned the name animation to it as 
mentioned above (/T (animation)). Finally, we define the appearance charac- 
teristics dictionary MK. The only entry /TP 1 sets the button’s appearance to 
consist only of an icon and no caption. 

83 % an animation widget 

84 \centerline{\hbox{7, 

85 \pdf startlink user{ 

86 /Subtype /Widget /FT /Btn 

87 /Ff 65536 /T (animation) 

88 /BS « /W 0 » 

89 /MK « /TP 1 » YL 

90 \image{f r0}7, 

91 \pdf endlink}} 

For the sake of brevity and clarity we are going to introduce only one control 
button in our example. However, we have defined a macro for creating control 
buttons to show a very simple way of including multiple control buttons. The 
\controlbutton macro takes one argument: the caption of the button it is to 
produce. The macro creates a pushbutton and binds it to an action defined like 
actionPlay. 

We have chosen control buttons to be pushbuttons again. They are little 
different from the animation widget - they are supposed to look like buttons. 
The BS dictionary (i.e., border style) sets the border width to 1 point and style 
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to 3D button look. The MK dictionary (appearance characteristics dictionary) 
sets the background color to 60% white and the caption (line 98). The /H /P 
entry tells the button to push down when clicked on. Finally, an action is bound 
to the button by setting the value of the A key. 

92 7, control button for a given action 

93 \def \controlbutton#l{7. 

94 \hbox to lcm{\pdf startlink user{ 

95 /Subtype /Widget /FT /Btn 

96 /Ff 65536 /T (Button#!) 

97 /BS « /W 1 /S /B » 

98 /MK « /BG [0.6] /CA (#1) » 

99 /H /P /A \oref {action#!} 

100 }\hf il\strut\pdf endlink}} 

And finally, we add a control button that plays the animation just below the 
animation widget. 

101 7. control button 

102 \centerline{\hf il 

103 \controlbutton{Play}\hf il} 

104 

105 \bye 

7 External Animation 

Let us modify the example a little so that the animation frames will be taken 
from an external file. This has several consequences which will be discussed at 
the relevant points in the code. 

We are going to completely detach the animation frames from the document. 
As a result, we will need only the \insertobj and \oref macros from lines 1-23 
from the previous example. Lines 26-31 are no longer required. 

A problem arises here: the basic frame should be displayed in the animation 
widget when the document is opened for the first time. This can be accomplished 
by modifying the OpenAction dictionary at line 25 as follows. 

\pdfcatalog{ /OpenAction << 

/S /JavaScript /JS ( 

var g = this .getFieldC animation" ) ; 
g . buttonimporticon ( 

"frames-ex.pdf" ,0) ; 
this.pageNum = 0; 
this . zoomType = zoomtype . f itP; 

) » } 

This solution suffers from the bug mentioned in the “Animation Dynamics” 
section. The animation widget will be empty until a user performs an action 
every time the bug comes into play. 
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We still do need an Aero Form dictionary, so lines 32-46 are left without a 
change. Lines 47-58 must be omitted on the other hand, as we have nothing to 
name. We are going to use the same animation as in the previous example, so 
lines 59-61 are left untouched. There is one modification of the JavaScript code 
to be done. The buttonSetIcon function call is to be replaced by 

g. butt onimport Icon ( 

"frames-ex.pdf", entr) ; 

We have used the basic frame to determine a size of the widget in the previous 
example. This is impossible now because it has to be done at compile time. The 
replacement for lines 83-91 is as follows 

°/o an animation widget 
\centerline{\hbox to 6cm{7, 

\vrule height 6cm depth Opt width Opt 
\pdf startlink user{ 

/Subtype /Widget /FT /Btn 
/Ff 65536 /T (animation) 

/BS « /W 0 » 

/MK « /TP 1 

/IF « /SW /A /S /P 

/A [0.5 0.5] » » YL 
\hf il\pdf endlink}} 

Dimensions of the widget are specified explicitly and an IF (icon fit) dictio- 
nary is added to attributes of the pushbutton so that the frames would be always 
(/SW /A) proportionally (/S /P) scaled to fit the widget. Moreover, frames are to 
be centered in the widget (/A [0.5 0.5]) which would be the default behavior 
anyway. The basic frame is not placed into the document - there is only glue 
instead. 

Lines 92-105 need not be modified. 

8 Two Notes on Animation Frames 

The examples with full T[;]X source files can be found at http : / /www . f i . muni . 
cz/~xholecek/animations/. As one can see in these examples, the all-in-one 
approach allows all frames to share a single background which is formed by 
the frame actually inserted into the page. However, it is possible to overlay 
pushbuttons. Elaborate constructions, the simplest of which is to use a common 
background frame in the example with external animations, can be achieved in 
conjunction with transparency. 

One must ensure the proper size of all frames when fitting them into the 
widget. We have encountered situations (the given example being one of them) 
where the bounding box of a METflPOST generated graphics with a label 
was not set properly using \convertMPtoPDF and a white line had to be drawn 
around the frames to force the proper bounding box as a workaround. 
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9 Animations in Other Formats 

It is fair to list and compare other possible ways of creating animations. In 
this section we give a brief overview of a dozen other formats and technologies 
capable of handling animations. 

9.1 GIF 

One of the versions of the GIF format is the GIF89A format, which allows multi- 
image support, with bitmap only animations to be encoded within a single GIF 
file. GIF format supports transparency, interlacing and plain text blocks. It is 
widely supported in Internet browsers. However, there are licensing problems 
due to the compression methods used, and the format is not supported in freely 
available Tp^Xware. 

9.2 SWF 

The SWF format by Macromedia allows storing frame-based animations, created 
e.g., by Macromedia’s Flash authoring tool. The SWF authoring tools have to 
compute all the animation frames at export time. As proprietary Flash plug- 
ins for a wide range of Internet browsers are available, animations in SWF are 
relatively portable. The power of SWF can be enriched with scripting by Action- 
Script. At the time of writing, we are not aware of any TgXware supporting 
SWF. 

9.3 Java 

One can certainly program animations in a general programming language like 
Sun’s Java. The drawback is that there are high demands on one’s programming 
capabilities in Java when creating portable animations. With NTS (a 
reimplementation in Java), one can possibly combine T);i]X documents with fully 
featured animations, at the expense of studying numerous available classes, 
interfaces and methods. 

9.4 DOM 

It is possible to reference every element in an HTML or XML document by 
means of the W3C’s Document Object Model (DOM), a standard API for 
document structure. 

DOM offers programmers the possibility of implementing animations with 
industry-standard languages such as Java, or scripting languages as ECMA- 
Script, JavaScript or J Script. 
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9.5 SVG 

The most promising language for powerful vector graphic animation description 
seems to be Scalable Vector Graphics (SVG), a W3C recommendation [12]. It 
is being developed for XML graphical applications, and since SVG version 1.1 
there is rich support for animations. The reader is invited to look at the freely 
available book chapter [13] about SVG animations on the publisher’s Web site, 
or reading [4] about the first steps of SVG integration into T[;]X world. There 
are freely available SVG viewers from Adobe (browser plug-in), Corel, and the 
Apache Foundation (Squiggle). 

SVG offers even smaller file sizes than SWF or our method. The description 
of animations is time-based, using another W3C standard, SMIL, Synchronised 
Multimedia Integration Language. The author can change only one object or its 
attribute in the scene at a time, allowing detailed control of animated objects 
through the declarative XML manner. Compared to our approach, this means 
a much wider range of possibilities for creators of animations. 

The SVG format is starting to be supported in T[^]Xware. There are SVG 
backends in VTEX and BAKoMATgX, and a program Dvi2Svg by Adrian 
Frischauf, available at http://www.activemath.org/~adrianf/dvi2svg/. An- 
other implementation of a DVI to SVG converter in C is currently being de- 
veloped by Rudolf Sabo at the Faculty of Informatics, Masaryk University in 
Brno. 



10 Conclusions 

We have shown a method of preparing both space-efficient and high-quality 
vector frame-based animations in PDF format using only freely available, T(i]X- 
integrated tools. 
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1 Introduction 

iT[j;XMac is an integrated suite of three major components: a text editor detailed 
in section 2, a PDF viewer detailed in section 3, and a TgX front end detailed in 
section 4. Some notes on installation are followed by remarks concerning inter- 
application communication in section 6 for other Mac OS X developers. Finally, 
the pdf sync feature and the Wrapper are discussed in sections 7 and 8. Since 
they concern the synchronization between the T[;]X source and PDF output, and 
a definition for a shared Jj7;X document structure, both will certainly interest 
the whole Jj7;X community. 

2 The Text Editor 

iT[j;XMac can be used either with a built-in text editor or an external one. 
All standard text editors like TextEdit, BBEdit, AlphaX, vi, and emacs are 
supported and configuring iTjjpCMac for other editors is very easy, even when 
coming from the XI 1 world. 



A. Syropoulos et al. (Eds.): TUG 2004, LNCS 3130, pp. 192-202, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The built-in text editor comes with flavours similar to emacs and AlphaX 
modes. It relies on a plug-in architecture that allows very different kinds of user 
interfaces according to the type of the file being edited. Whereas AlphaX uses 
Tcl and emacs uses Lisp, iTgXMac utilizes the benefits of Objective-C bundles, 
giving plug-ins great potential power together with the application. 

Among the standard features shared by advanced text editors (like key 
binding management, advanced regular expressions, command completion), an 
interesting feature of iT^XMac’s text editor is the syntax parsing policy. The 
syntax highlighting deeply depends on the kind of text edited, whether it is Plain, 
IXTp^X or METflPOST (support for HTML is planned). The text properties used 
for highlighting include not only the color of the text, but also the font, the 
background color and some formatting properties. 

Moreover, the command shortcuts that refer to mathematical and text sym- 
bols are replaced by the glyph they represent, thus replacing \ alpha with a and 
so on. Conversely the built-in editor can show a character palette with 42 menus 
gathering text and mathematical symbols, as they would appear in the output. 
The editor thus serves as a graphical front-end to the standard HTeX packages, 
Eunsf onts . sty, amssymb . sty, mathbb.sty, mathrsf s . sty, marvosym. sty and 
wasysym.sty, which makes thousands of symbols available with just one click. 
The result is a text editor that contains much more WYSIWYG than others, 
with no source file format requirement. 

There is also advanced management of string encoding, and iT(;]XMac sup- 
ports more than 80 of them with an efficient user interface. The text flies are 
scanned for hints about the text encoding: 



LTeX 


\usepackage [encoding] {inputenc} 


ConTEXt 


\enableregime [encoding] 


emacs 


7,-*-coding : character encoding ; 




7, ! iTeXMac (charset) : 

character encoding 




Mac OS X hidden internals 



But this is not user friendly practice and will be enhanced by the forthcoming 
discussion of TgX wrappers in section 8. 

Spell checking for TffpC input is available with Rick Zaccone’s HTEX-aware 
Excalibur^ and Anton Leuski’s TgX-aware cocoAspelP, a port of the Free and 
Open Source spell checker aspell. The latter also knows about HTML code and 
is integrated to Mac OS X allowing iTgXMac to check spelling as you type, with 
misspelled words being underlined in red. While this service is provided to all 
applications for free, iTgXMac is the only one that truly enables TgX support by 
managing the language and the list of known words on a file by file basis using 
Tf^X Wrappers. 



^ http : / / WWW . eg . bucknell . edu/~ excalibr/ 
^ http://cocoAspell.leuski.net/ 
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3 The PDF Viewer 

iTJ^Mac can be used either with a built-in PDF viewer or an external one. 
The built-in viewer lacks many advanced features of Acrobat or Preview, but it 
updates the display automatically when a PDF file has been changed externally. 
Moreover, it allows you to make selections and export them to other applications, 
like Word, TextEdit or Keynote for example. Finally, it has support for the useful 
PDF synchronization discussed below and is very well integrated in the suite. 

iT[;]XMac can open PS, EPS and DVI files with a double click, by first 
converting them to PDF. It thus plays the role of a PostScript or a DVI 
viewer. This feature is now partially obsolete since Mac OS X version 10.3 
provides its own PS to PDF translator used by the Preview application shipped 
with the system. 



4 The Te?^ Front End 

This component of the software serves two different purposes. On one hand it 
is a bridge between the user and the utilities of a standard TJnX distribution: a 
graphical user interface for the commands tex, latex, pdftex, and so on. On 
the other hand, it has to properly manage the different kinds of documents one 
wants to typeset. 

Actually, the iT[5]XMac interface with its underlying TgX distribution is fairly 
simple. Five basic actions are connected to menu items, toolbar buttons or 
command shortcuts to 

— typeset (e.g., running latex once, or twice in advanced mode) 

— make the bibliography (e.g. running bibtex) 

— make the index (e.g., running makeindex) 

— render graphics (e.g., running dvipdf) 

All these actions are connected to shell scripts stored on a per document basis. 
If necessary, the user can customize them or even change the whole process by 
inserting an in-line instruction at the very beginning of a source file. For example, 
the following directive, if present, will run pdf latex in escape mode. 

7« ! ITeXMac (typeset) : pdf latex 
— shell-escape $iTMlnput 

The makeindex and bibtex command options can be set from panels, and 
other commands are supported. Moreover, the various log files are parsed, warn- 
ings and errors are highlighted with different colors and HTML links point to 
lines where an error occurred. Some navigation facilities from log file to output 
are also provided, a string like [a number ...'] , pointing to the output page. 

As for documents, iTgXMac manages a list of default settings that fit a wide 
range of situations, including for example 
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— MJcX documents with DVI, PS or PDF engines 

— books 

— METRP05T documents 

— ConTgXt documents 

— HTML documents 

— B. Gaulle’s French Pro documents. 

Users can extend this list with a built-in editor, adding support for MusicTJcX, 
maybe gcc, and so on. 



5 Installing and il^^Mac 

The starting point for a detailed documentation is the MacOS X TeX/DTeX 
Web Site^ where one will find an overview of the TeX related tools available on 
Mac OS X. As a graphical front-end, iTEXMac needs a TeX distribution to be 
fully functional. Gerben Wierda maintains on his site^ the TeX Live® distribution 
and a set of useful packages. Other teTEX 2.0.2 ports are available from fink® 
(and through one of its graphical user interfaces, such as finkcommander^) and 
from Darwin Ports®, through a CVS interface. 

The official web site of iTEXMac is hosted by the Open Source software 
development SourceForge website at: http : / / itexmac . sourcef orge . net/. One 
can find in the download section the disk images for the following products: 

— iTEXMac, both stable and developer release 

— an external editor for iTEXMac key binding 

— the Hypertext Help With DTeX wrapped as a searchable Mac OS X help file 

— the TeX Catalog On line wrapped as a searchable Mac OS X help file 

— the French DTeX FAQ wrapped as a searchable Mac OS X help file. 

An updater allows you to check easily for new versions. To install iTEXMac, just 
download the latest disk image archive, double click and follow the instructions 
in the read-me file. 

Due to its Unix core, Mac OS X is no longer focused on only one user. To 
support multiple users, iTEXMac configuration files can be placed in different 
locations to change defaults for all, or just certain users. The search path is: 

— the built-in domain as shipped with the application (with default, read-only 
settings) 

— the network domain (/Network/Library/ApplicationSupport/iTeXMac), 
where an administrator can put material to override or augment the default 
behaviour of all the machines on a network 

® http : / / WWW . esm . psu . edu/mac-tex/ 

^ http://www.rna.nl/tex.html 
® http://www.tug.org/texlive/ 

® http : //fink. sourceforge.net/pdb/package .php/tetex 
^ http : //f inkcommander . sourceforge.net/ 

® http : //darwinports . opendarwin. org/ports/?by=cat&substr=print 
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— the local domain (/Library/Application\-Sup\-port/ iTeXMac), where an 
administrator can put material to override or augment the network or default 
behaviour 

— the user domain (~/Library/Application\-Sup\-port/iTeXMac), where 
the user can put material to override or augment the local or default be- 
haviour 

This is a way to apply modifications to iTgXMac as a whole. 

6 Inter-application Communication 

This section describes how iTf^Mac communicates with other components, in 
the hope that this syntax will also be used by other applications when relevant, 
to avoid the current situation where there are as many AppleScript syntaxes as 
there are available applications for on the Macintosh. It also shows why 
iTp^XMac integrates so well with other editors or viewers. 

6.1 Shell Commands 

iTp^XMac acts as a server, such that other applications can send it messages. Each 
time it starts, iTg]XMac installs an alias to its own binary code in "^/Library/ 
TeX/bin/ iTeXMac. With the following syntax®, either from the command line, an 
AppleScript or shell script, one can edit a text file at the location corresponding 
to the given line and column numbers 

~/Library/TeX/bin/iTeXMac edit -file "filename" 

-line lineNumher -column colNumber 

The following syntax is used to display a PDF file at the location corresponding 
to the given line column and source file name 

~/Library/TeX/bin/iTeXMac display -file "filename.pdf" 

-source "sourcename.tex" -line lineNumher -column colNumber 

6.2 AppleScript 

The same feature is implemented using this scripting language. It would be great 
for the user if other front ends on Mac OS X would implement the same 
syntax. 



tell application "iTeXMac" to edit "filename.tex" 
at line lineNumher column colNumber 

tell application "iTeXMac" to display "filename.pdf" 
at line lineNumher column colNumber 
in source " Posix source name.tex" 

® These commands should be entered all on one line. They are broken here due to the 
relatively narrow columns. 
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iT[7;XMac support for AppleScript likewise covers the compile, bibliography and 
index actions. They are not given here since there is no Apple Events Suite 
dedicated to However, configuration files and instructions are given to let 
third-party applications like Alpha X, BBEdit or emacs control idJ^XMac using 
those scripts. 

6.3 HTML 

idJ^Mac implements support for a URL scheme named file-special for edit- 
ing, updating or displaying files, for example 

file-special : / /loc&lhost/ " filename.tex" ; 

action=edit ; l±ne=lineNumber ; colmm= columnNumber 

file-special : / /localhost/ " filename.pdf " ; action=display ; 
l±ne=lineNum; colmm= columnNum', so\n:ce=" Posix source name.tex" 

will ask n^^XMac to edit a source file or display the given file (assumed 
to be PDF) and when synchronization information is available, scroll to the 
location corresponding to the given line and column in source (assumed to be 
TJ;]X). This allows adding dynamic links in HTML pages, in a T[;]X tutorial for 
example. 

7 The pdfsync Feature 

7.1 About Synchronization 

As the Tf<]X typesetting system heavily relies on a page description language, 
there is no straightforward correspondence between a part of the output and the 
original description code in the input. A workaround was introduced a long time 
ago by commercial frontends Visual TgX^° and TgXtures^^ with a very effi- 
cient implementation. Then DT[;]X users could access the same features - though 
in a less-efficient implementation - through the use of srcltx . sty, which added 
source specials in the DVI files. The command line option -src-specials now 
gives this task to the TgX typesetting engine. 

When used with an external DVI viewer or an external text editor, through 
an XI 1 server or not, id^^Mac fully supports this kind of synchronization 
feature. 

For the PDF file format, Piero d’ Ancona and the author elaborated a strat- 
egy that works rather well for Plain TgX, ConT^Xt and DTeX users. While type- 
setting a f 00 . tex file with DTgX for example, the pdfsync package writes extra 
geometry information in an auxiliary file named f oo . pdfsync, subsequently used 
by the front ends to link line numbers in source documents with locations in pages 
of output PDF documents. iTgXMac and TgXShop^^ both support pdfsync. 

http : //www. micropress- inc . com/ 

http://www.bluesky.com/ 

http : / / WWW . uoregon . edu/~koch/ texshop 
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The official pdf sync web site is: 

http : / / itexmac . sourcef orge .net /pdf sync . html 

It was more convenient to use an auxiliary file than to embed the geometric 
information in the PDF output using pdftex primitives. The output file is not 
polluted with extraneous information and the front ends need not parse the PDF 
output to retrieve such metrics. 

7.2 The pdf sync Mechanism 

A macro is defined to put pdf sync anchors at specific locations (for hbox’s, 
paragraphs and maths). There are essentially three problems we must solve: 
the position of an object in the PDF page is not known until the whole page 
is composed, the objects don’t appear linearly in the output^^ and finally, an 
input file can be entirely parsed long before its contents are shipped out. To 
solve these, at each pdf sync anchor the known information (line number and 
source file name) is immediately written to the pdf sync auxiliary file and the 
unknown information (location and page number) will be written only at the 
next ship out. 

7.3 The .pdf sync File Specifications 

This is an ASCII text file organized into lines. There is no required end of line 
marker format from among the standard ones used by operating systems. 

Only the two first lines described in table 1 are required, the other ones are 
optional. The remaining lines are described according to their starting characters, 
they consist of 2 interlaced streams. A synchronous one detailed in table 2 
is obtained with \immediate\writes and concerns the input information. An 
asynchronous one detailed in table 3 is obtained with delayed \writes and 
concerns the output information. 



Table 1. pdf sync required lines format. 



Line 


Format 


Description 


Comment 


j^St 


jobName 


jobName: case sensi- 
tive ThiX file name 


In general, the extensionless name of the 
file as the result of an 
\immediate\write\f ile{\jobname} 


2nd 


version V 


F: a 0 based nonneg- 
ative integer 


The current version is 0 



The correspondence between the two kinds of information is made through a 
record counter, which establishes a many-to-many mapping from line numbers 
in T[<]X sources to positions in PDF output. 

The footnote objects provide a good example. 
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Table 2. pdf sync line specifications of the synchronous stream. 



1 ^^ 


Format 


Description 


Comment 


“b” 


b name 


name: TgX file name 


TeX is about to begin parsing name, all 
subsequent line and column numbers will 
refer to name. The path is relative to 
the directory containing the .pdfsync file. 
Path separators are the Unix “/” . The file 
extension is not required, “tex” is the de- 
fault if necessary. Case sensitive. 




e 




The end of the input file has been reached. 
Subsequent line and column numbers now 
refer to the calling file. Optional, but must 
match a corresponding “b” line. 


1 


1 R L 
1 R L C 


R: record number, 

L: line number, 

C: optional column 
number. 





Table 3. pdf sync line specification of the asynchronous stream. 



Line 


Format 


Description 


Comment 




s S 


S: physical page number 


TTjX is going to ship out a new page. 


“p” 

“p*” 


Tp R X y 
p* R X y 


R: record number, 
x: horizontal coordinate, 
y: vertical coordinate. 


Both coordinates are respectively 
given by \the\pdf lastxpos and 
\the\pdf lastypos 



7.4 Known Problems 

Unfortunately, the various pdfsync files for Plain, UT^X or ConT^Xt are not 
completely safe. Some compatibility problems with existing macro packages may 
occur. Moreover, sometimes pdfsync actually influences the final layout; in a case 
like that, it should only be used in the document preparation stage. 

Another mechanism widely used by ConTgXt makes pdfsync sometimes 
inefficient, where the macro expansion only occurs long after it has been parsed, 
such that the \inputlineno is no longer relevant and the significant line number 
is no longer accessible. This makes a second argument for the implementation of 
the pdfsync feature at a very low level, most certainly inside the pdftex engine 
itself. 



8 TWS: A TgX Wrapper Structure 

In general, working with TJ^X seems difficult due to the numerous auxiliary files 
created. Moreover, sharing TffpC documents is often delicate as soon as we do not 
use very standard UT[5]X- The purpose of this section is to lay the foundation for 
the TgX Wrapper Structure, which aims to help the user solve these problems. 
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Table 4. Contents of the Project directory document, texp. 



Name 


Contents 


Info.plist 


XML property list for any general purpose infor- 
mation wrapped in an info dictionary described in 
Table 5. Optional. 


spellingKey. spelli-ng 


XML property list for lists of known words wrapped 
in a spelling dictionary defined in table 9 and 
uniquely identified by spellingKey. This format is 
stronger than a simple comma separated list of words. 
Optional. 


f rontends 


directory dedicated to front-ends only. 


f rontends / name 


private directory dedicated to the front-end identified 
by name. The further contents definition is left under 
the front-end responsibility. 


users 


directory dedicated to users. Should not contain any 
front-end specific data. 


users / name 


directory dedicated to the user identified by name 
(not its login name). Not yet defined, but private and 
preferably encrypted. 



First, it is very natural to gather all the files related to one TJllX document in 
one folder we call a TgX Wrapper. The file extension for this directory is texd, 
in reference to the rtf and rtfd file extensions already existing on Mac OS X. 
The contents of a TgX wrapper named document .texd is divided according to 
different criteria: 

— required data (source, graphics, bibliography database) 

— helpful data and hints (tex, bibtex, makeindex options, known words) 

— user-specific data 

— front-end-specific data 

— cached data 

— temporary data 

It seems convenient to gather all the non-required information in one folder 
named document.texd/ document.tex'p such that silently removing this directory 
would cause no harm. As a consequence, no required data should stay inside 
document.tex'p, and this is the only rule concerning the required data. The texp 
file extension stands for “TgX Project” . 

In Tables 4 to 9 we show the core file structure of the document.texp di- 
rectory. This is a minimal definition involving only string encoding and spelling 
information because there is no consensus yet among users and all the developers 
of TgX solutions, on Mac OS X at least. We make use of the XML property list 
data format storage as defined by 



http : //www . apple . com/DTDs/PropertyList-1 . 0 . dtd 
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Table 5. info dictionary description. 



Key 


Class 


Contents 


isa 


String 


Required with value: info 


version 


Number 


Not yet used but reserved 


files 


Dictionary 


The paths of the files involved in the project wrapped 
in a files dictionary. Optional. 


properties 


Dictionary 


Attributes of the above files wrapped in a properties 
dictionary. Optional. 


main 


String 


The fileKey of the main file, if relevant, where fileKey 
is one of the keys of the files dictionary. Optional. 



Table 6. files dictionary description. 



Key 


Class 


Contents 


fileKey 


String 


The path of the file identified by the string fileKey, 
relative to the directory containing the T]eX project. 
No two different keys should correspond to the same 
path. 



Table 7. properties dictionary description. 



Key 


Class 


Contents 


fileKey 


Dictionary 


Language, encoding, spelling information and other 
attributes wrapped in an attributes dictionary de- 
scribed in table 8. fileKey is one of the keys of the 
files dictionary. 



Table 8. attributes dictionary description. 



Key 


Class 


Contents 


isa 


String 


Required with value: attributes 


version 


Number 


Not yet used but reserved 


language 


String 


According to latest ISO 639. Optional. 


codeset 


String 


According to ISO 3166 and the lANA A ssigned 
Character Set Names. If absent the standard C-f-l- 
locale library module is used to retrieve the codeset 
from the language. Optional. 


eol 


String 


When non void and consistent, the string used as end 
of line marker. Optional. 


spelling 


String 


One of the spellingKeys of table 4, meaning that 
spellingKeys.sTpelling contains the list of known 
words of the present file. Optional. 
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Table 9. spelling dictionary description. 



Key 


Class 


Contents 


isa 


String 


Required with value: spelling 


version 


Number 


Not yet used but reserved 


words 


Array 


The array of known words 



However, this mechanism doesn’t actually provide the concrete information 
needed to typeset properly (engine, format, output format). For that we can use 
Makefiles or shell scripts either embedded in the T[;]X Wrapper itself or shipped 
as a standard tool in a TgX distribution. This latter choice is less powerful but 
much more secure. Anyway, a set of default actions to be performed on a T)t;X 
Wrapper should be outlined (compose, view, clean, archive...). 

Technically, iTf<]XMac uses a set of private, built-in shell scripts to typeset 
documents. If this is not suitable, customized ones are used instead, but no 
warning is given then. No security problem has been reported yet, most certainly 
because such documents are not shared. 

Notice that id^^XMac declares texd as a document wrapper extension to Mac 
OS X, which means that document.texd folders are seen by other applications 
just like other single file documents, their contents is hidden at first glance. 
Using another file extension will prevent this Mac OS X feature without losing 
the benefit of the TgX Wrapper Structure. 

A final remark concerns the version control system in standard use among 
Tf^X users. In the current definition, only one directory level should be supported 
in a document.texp folder. The contents of the frontend and users should not 
be monitored. 



9 Nota Bene 



Some features discussed here are still in the development stage and are still being 
tested and validated (for example, advanced syntax highlighting and full TWS 
support). 
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Abstract. This article sums up our experience with MlBieTgX, our 
multilingual implementation of BibTj^X, and points out some possible im- 
provements for better co-operation between and MlBlBTgX. Also, 

MlBlBTgX may be used to generate bibliographies written according to 
other formalisms, especially formalisms related to xml, and we give some 
ways to ease that. 

Keywords: Bibliographies, multilingual features, BlsTgX, MlBlBTgX, 
bst, nbst, XML, XSLT, XSL-FO, DocBook. 



1 Introduction 

MlBffiTgX (for ‘MultiLingual BmTgX’) is a reimplementation of BisTgX [21] , the 
bibliography processor associated with IXTeX [19]. The project began in October 
2000, and has resulted in two experimental versions [9, 11] and the present 
version (1.3), that will be available publicly by the time this article appears. 
As we explained in [15], a prototype using the Scheme programming language 
is working whilst we are developing a more robust program written in C. The 
prototype has allowed us to get some experience with real-sized bibliographies: 
this is the purpose of the first part of this article, after a short review of the 
modus operandi of MlBiBTgX. 

MlBisTgX’s present version no longer uses the bst language of BiBTgX for 
bibliography styles [20]. Such .bst files were used in MlBiBTgX’s first version, but 
since this old-fashioned language, based on simple stack manipulations, is not 
modular, we quickly realised that this choice would have led us to styles that 
were too complicated [12]. Thus, Version 1.3 uses the nbst (for ‘New Bibliography 
STyles’) language, described in [13] and similar to xslt^, the language of trans- 
formations designed for xml texts [32]. More precisely, MIBibTjtX 1.3 uses XML^ 
as a central formalism in the sense that parsing files containing hihliographical 

^ Extensible Stylesheet Language Transformations. 

Extensible Markup Language. A good introduction to this formalism issued by the 
w3c (World Wide Web Consortium) is [24]. 



A. Syropoulos et al. (Eds.): TUG 2004, LNCS 3130, pp. 203—215, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 
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@B0DK{silkel988, 

AUTHOR = {James'K. Silke}, 

TITLE = {Prisoner of the Horned Helmet}-, 

PUBLISHER = {Grafton Books}-, 

YEAR = 1988, 

NUMBER = 1, 

SERIES = {Frank Frazetta’s Death Dealer}-, 

NOTE = {[Pas de traduction f ran\c{c}-aise 
connue] ! french 
[Keine deutsche Ubersetzung] 

! german} , 

LANGUAGE = english} 

Fig. 1. Example of a bibliographical entry in MlBieTgX. 



entries (.bib files) results in a dom^ tree. Bibliography styles written using nbst 
are XML texts, too. 

Of course, nbst can be used to generate bibliographies for documents other 
than those processed with In particular, nbst eases the generation of bib- 

liographies for documents written using XML-like syntax. Nevertheless, dealing 
with .bib files raises some problems: we go into them thoroughly in Section 4. 

Reading this article requires only a basic knowledge of I^TgX, BisTgX and 
XML. Some examples given in the next section will use the commands provided by 
the multilingual babel package of IXTgX2£ [2]. Other examples given in Section 4 
will use the Scheme programming language, but if need be, referring to an 
introductory book such as [28] is sufficient to understand them. 

2 Architecture of MlBisT^jX 

2.1 How MlBlBT^jX Works 

As a simple example of using MlBmTgX with IXTgXj let us consider the si Ike 1988 
bibliographical entry given in Figure 1. As we explain in [15], the sequence ‘ [ . . . ] 

! (idf )’ is one of the multilingual features provided by MlBmTgX, defining a 
string to be included when the language of a corresponding reference, appearing 
within a bibliography, is idf . So if this entry is cited throughout a document 
written in French and the ‘References’ section is also written in French, it will 
appear as: 

[1] James R. Silke: Prisoner of the Horned Helmet. N° 1 in Frank 
Frazetta’s Death Dealer. Grafton Books, 1988. Pas de traduction 
frangaise connue. 

® Document Object Model. This is a w3c recommendation for a standard tree-based 
programming approach [24, p. 306-308], very often used to implement xml trees. 

^ This is also the case with the bst language of BlsTgX, but in practice, it seems 
that this feature has not been used, except for documents written in SCRIBE [25], a 
predecessor of DIFX. 
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Here and in the bibliography of this article, we use a ‘plain’ style, that is, 
references are labelled with numbers. More precisely, the source processed by 
HTp^X, included into the .bbl file generated by MlBmTgX, is: 

\begin{thebibliography}{ . . .} 

\bibitem{silkel988} 

\begin{otherlanguage*}{english} 

Jcunes'R. \textsc{Silke} : \emph{Prisoner of the 
Horned Helmet}. 

\f oreignlanguageff rench}{\bblno~l \bblof } 

\emph{Frank Frazetta’s Death Dealer}. Grafton 
Books, 1988. \f oreignlanguage{french}{Pas de 
traduction fran\c{c}aise connue}. 

\end{otherlanguage*} 

\end{thebibliography} 

Let us examine this source text. We can notice the use of additional HTeX 
commands to put some keywords (‘\bblin’ for ‘m’, ‘\bblno’ for ‘N°’, that is, 
‘number’ in French). In [14], we explain how to put them into action within 
and how MlBmTgX uses them. This source also shows how English words, 
originating from an entry in English (see the value of the LANGUAGE field in 
Figure 1), are processed. If the document uses the babel package, and if the french 
option of this package is selected, we use the \foreignlanguage command of 
this package [2], as shown above. Users do not have to select its english option; 
if it is not active, the source text generated by MlBmTgX looks like: 

\bibitem{silkel988}James~R. \textsc{Silke} : 

\emph{Prisoner of the Horned Helmet}, \bblno~l 
\bblof\ \emph{Frank Frazetta’s Death Dealer}. 

Grafton Books, 1988. Pas de traduction 
fran\c{c}aise connue. 

but the English words belonging to this reference will be taken as French by 
IMJ^X and thus may be processed or hyphenated incorrectly. 

2.2 The Modules of M1 BibT^}X 

As mentioned in the introduction, parsing a .bib file results in a DOM tree. In 
fact, .bib files are processed as if they were xml trees, but without whitespace 
nodes®. Following this approach, the entry silkel988 given in Figure 1 is viewed 

® These are text nodes whose contents are only whitespace characters, originating from 
what has been typed between two tags [27, p. 25-26]. For example, if the xml text of 
Figure 2 is parsed, there is a whitespace node, containing a newline and fonr space 
characters between the opening tags <author> and <name>. xml parsers are bound 
by the ‘all text counts’ constraint inclnded in the xml specification [33, § 2.10], and 
cannot ignore such whitespace characters. 




206 



Jean-Michel HufHen 



<book id="silkel988" language="english"> 
<author> 

<name> 

<personnaine> 

<f irst>James R.</f irst> 
<last>Silke</last> 

</personname> 

</name> 

</ author> 

<title>Prisoner of the Horned Helmet</title> 
<publisher>Grafton Books</publisher> 
<year>1988</year> 

<number> l</number> 

<series> 

Frank Frazetta’s Death Dealer 
</ series> 

<note> 

<group language="f rench"> 

Pas de traduction frangaise connue 
</ group> 

<group l£uiguage="german"> 

Keine deutsche Ubersetzung 
</ group> 

</note> 

</book> 



Fig. 2. The XML tree corresponding to the entry of Figure 1. 

as the tree of Figure 2, except that the whitespace nodes that an xml parser 
would produce are excluded. 

We can see that some E^T^X commands and special characters are converted 
according to the conventions of xml. 

— The commands used for accents and special letters are replaced by the letter 
itself. This poses no problem since DOM trees are encoded in Unicode [29]. As 
an example, the ‘\c{c}’ sequence in the value of the NOTE field in Figure 1 is 
replaced by ‘5’ in Figure 2. (By the way, let us remark that MlBiBTgX can 
handle the 8-bit latinl encoding®: notice the ‘U’ character inside this value.) 

— Likewise, the commands: 

• ‘\u’ for a simple space character, 

• ‘\\’ for an end-of-line character, 
and the sequences of characters: 

• for an unbreakable space character, 

• ‘ — ’, and ‘ ’ for dash characters, 

are replaced by the corresponding Unicode values for these characters^: 

® See [7, Table C.4] for more details. 

^ That was not the case in earlier versions; for instance, [12, Figure 3] improperly 
includes a tilde character in a text node. This bug was fixed at the end of 2003. 
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<nbst : template mat ch= " group " > 

<nbst : if test="@language=$language"> 

< ! — The Slanguage variable is set to the current language. — > 
<nbst : value-of select="call(language_open_change , Slanguage) "/> 

< ! — If the babel package is used and a known option has been selected, 
this external function writes the \f oreignlanguage command. . . 

— > 

<nbst : apply-templates use-language="@language"/> 

<nbst : value-of select="call(language_close_change .Slanguage) "/> 
< ! — ... and this external function puts a closing brace. — > 

<nbst : if > 

</nbst :template> 



Fig. 3. Example of calling an external function. 



  
   – — 

An example is given by the value of the AUTHOR field, see Figures 1 & 2. 

— Some characters escaped in IATeX (for example, ‘7.’, ‘&’) lose the escape 
character: 



\7. 7. 

The escape is restored if MIBibTeX generates a .bbl file to be processed by 
MJ^X. Other characters are replaced by a reference to a character entity®: 

\& feainp; < => felt; > > 

— Balanced delimiters for quotations ” and “ ’ ” or ‘ ‘ ‘ ’ and ‘ ’ ’ ’) are 
replaced by an emph element®: 

'Tooth and Claw’ =4« 

<emph emf=’no’ quotedf=’yes’> 

Tooth and Claw 
</emph> 

If ‘ ’ or ‘ ’ characters are unbalanced, they are replaced by references to 
character entities used in xml documents: 

’ feapos; " => &quot ; 

Such an xml tree, resulting from our parser, may be validated using a dtd^®; 
more precisely, by a revised version of the dtd sketched in [10]. 

Some examples of using nbst for bibliography styles are given in [12-14]. We 
give another example in Figure 3. We can see that this language is close to xslt 

® See [24, p. 48] for more details. 

® ‘emph’ is of course for ‘emphasise’: all the attributes (for example, ‘quotedf’ for 
‘quoted-flag’, used for specifying a quotation) default to no, except emf, which 
defaults to yes. The complete specification is given in [10]. 

Document Type Definition. A dtd defines a document markup model [24, Ch.5]. 
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and it uses path expressions as in the XPath language [31]. Also, the example 
shows how multilingual features (for example, the sequence ‘ [. . .] ! . . .’) are 
processed: we use some external functions in order to determine which lATgX 
command can be used to switch to another language. These external functions 
are written using the language of MlBisTgX’s implementation: Scheme for the 
prototype, C for the final program. 



3 with 

When BiBTgX generates a .bbl file, it does not use the source file processed by 
IXTf^X, but only the auxiliary (.aux) file, in which the definition of all the labels 
provided by the commands \label and \bibitem is stored. This file also contains 
the name of the bibliography style to be used and the paths of bibliography data 
bases to be searched, so BisTgX need not look at any other file. 

This is not true for MlBisTgX. It still uses the .aux file as far as possible, but it 
also has to determine which multilingual packages are used: first of all babel, but 
also some packages devoted to particular languages: french [6], german [23], . . . 
So we have to do a partial parsing of the .tex file for that. For better co-operation 
between lATp^X and MlBmTgX, this could be improved, in that information about 
multilingual packages used, and languages available, could be put in the .aux file. 
In fact, the external functions of our new bibliography styles are only used to 
manage information extracted from a .tex file. Expressing such operations using 
nbst would be tedious. 

Another improvement regarding the natural languages known by ETgX would 
be a connection between: 

a) the language codes used in xml, specified by means of a two-letter language 
abbreviation, optionally followed by a two-letter country code [1] (for exam- 
ple, ‘de’ for ‘deutsch’ (‘German’), ‘en-UK’, ‘en-US, etc.)’; and 

b) the resources usable to write texts in these languages. 

For example, a default framework could be the use of the babel package, and 
‘de’ would get access to the german option of this package, although it could be 
redefined to use the ad hoc package name german. In the future, such a framework 
would allow us to homogenise all the notations for natural languages to those of 
XML. In addition, let us notice that ConTJ;]Xt^^ [8], already uses these two-letter 
codes in its \selectlanguage command. 

And last but not least, auxiliary files should include information about the 
encoding used in the source text. As can be seen in the examples of Section 2.1, 
accented letters are replaced by the commands used to produce them in IXT^X, 
even though lATgX can of course handle 8-bit encodings (provided that the 
inputenc package is loaded with the right option). This is to avoid encoding 

TEX, defined by Donald E. Knuth [18], provides a general framework to format texts. 
To be fit for use, the definitions of this framework need to be organised in a format. 
Two such formats are plain TEX and DTEX, and another is ConTEXt, created by 
Hans Hagen. 
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<nbst:bst version="l . 3" id="plain" xmlns:nbst= 
"http: //life .univ-f comte .fr/~hufflen/mlbibtex" 
> 



<nbst : output method="LaTeX"/> 

</nbst :bst> 

Fig. 4. Root element for a bibliography style written using nbst. 



problems. In addition, such information would ease the processing of languages 
written using non-Latin alphabets. 



4 Towards the XML World 

Since a .bib file can be processed as an xml tree by a bibliography style written 
in nbst, MlBmTgX opens a window on xml’s world. A converter from .bib files 
to a file written using html^^, the language of Web pages, becomes easy to 
write. So does a tool to write a bibliography as an xsl-fo^^ document [34]. 
More precisely, we give in Figure 4 an example of using the root element of nbst. 
Possible values for the method of the nbst : output element are: 

LaTeX xml html text 

Nevertheless, this approach has an important limitation in practice. Since 
BiBTgX has traditionally been used to generate files suitable for IXTeX, users 
often put commands inside values of BisTgX fields^'^. For example: 

ORGANIZATION = {\textsc{tug}} 

In such a case, we would have to write a mini-IXTEX program (or perhaps a new 
output mode for IXTeX) that would transform such a value into a string suitable 
for an xml parser. 

The problem is more complicated when commands are defined by end-users. 
For instance: 



ORGANIZATION = {\logo{tug}} 

works with BmTgX - or MlBiBTgX when we use it for generating IXTeX output 
- even though \logo has an arbitrary definition; for example, 

\newcommand{\logo} [1] {\textsc{#l}} 

HyperText Markup Language. 

Extensible Stylesheet Language - Formatting Objects: this language aims to 
describe high-quality print outputs. Such documents can be processed by the shell 
command xmltex (resp. the shell command pdfxmltex) from PassiveT^X [22, p. 180] 
to get .dvi files (resp. .pdf files). 

The author personally confesses to using many \f oreignlanguage commands within 
the values of BisTgX fields, before deciding to develop MlBiBTppC. 
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<bibliography> 

<title>Ref erences</title> 

<biblioentry> 

<abbrev>silkel989</abbrev> 

<authorgroup> 

<author> 

<firstname> James R. </f irstname> 

<surname>Silke</ surname> 

</ author> 

</ authorgroup> 

<copyright><year>1989</year></ copyright> 

<isbn>0-586-07018-4</ isbn> 

<publisher> 

<publishername> 

Grafton Books 
</publishername> 

</publisher> 

<title>Lords of Destruction</title> 

<seriesinf o> 

<title> 

<othercredit> 

<f irstname>Frank</f irstname> 

<surname>Frazetta</ surname> 

</othercredit>’s Death Dealer 
</title> 

<volumenum>2</ volumenum> 

</ seriesinf o> 

</biblioentry> 

</bibliography> 

Fig. 5. The bibliographical reference from Figure 1 expressed in DocBook. Note the 
ad hoc tag <othercredit>. 



according to B^T[;]X’s conventions, or: 

\def \logo#l{\textsc{#l}} 

if a style close to plain T[;]X is used. Likewise, such commands can be known 
when an output file from MlBisTgX is processed by ConT[;]Xt. 

Moreover, let us consider the bibliographical reference given in Figure 5, 
according to the conventions of DocBook, a system for writing structured doc- 
uments [36] (we use the conventions of the xml version of DocBook, described 
in [26]). We can see that some information is more precise than that provided 
in Figure 1. But there are still complexities: the person name given in the value 
of the SERIES field is surrounded by an ad hoc element in the DocBook version. 

If we want to take advantage of the expressive power of DocBook, we can: 
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— directly process an xml file for bibliographical entries. In this case, our dtd 
should be extended; that is possible, but we still need a solution to process 
the huge number of existing .bib files; 

— introduce some new syntax inside .bib files, that might be complicated and 
thus perhaps unused in practice, 

— introduce new commands, to process like the \logo example men- 

tioned above. 

We have experimentally gone quite far in the third direction, which also 
allows to us to deal with the E^TgX commands already in .bib files. In Figure 6, 
we give some examples of such processing, as implemented in the prototype^^. 

As can be seen, we have defined a new function in Scheme, named def ine- 
pattern, with two formal arguments. The first is a string viewed as a pattern, 
following the conventions of for defining commands, that is, the arguments 
of a command are denoted by ‘#1’, ‘#2’, ... (cf. [18, Ch. 20]). The second 
argument may also be a string, in which case it specifies a replacement. The 
arguments of the corresponding command are processed recursively. In case of 
conflict among patterns, the longest is chosen. So, the pattern "\\logo{#l}"^® 
takes precedence over the pattern "{#!}". 

If the second argument of the def ine-pattern function is not a string, it 
must be a zero-argument function that results in a string. In this case, all the 
operations must be specified explicitly, using the following functions we wrote: 

pattern-matches? returns a true value if its first argument matches the second, 
a false value otherwise; 

pattern-process recursively processes its only argument, after replacing sub- 
patterns by corresponding values^^; 

pattern-replace replaces the sub-patterns of its argument by corresponding 
values; these value are not processed, just replaced verbatim. 

Whether given directly as the second argument to def ine-pattern or resulting 
from applying a zero-argument function, the string must be well-formed w.r.t. 
xml’s conventions, that is, tags must be balanced, attributes must be well- 
formed, etc. In other words, such a string must be acceptable to an xml parser: 
in our case, the parser is SSAX^® [17]. 

The examples given in Figure 6 allow us to see that we can deal with simple 
commands, like: 

\logo{. . .} => <emph ...>.. .</emph> 
as well as more complicated cases, like a cascade of \iflanguage commands [2]: 



This feature has not yet been implemented in the final version. 

Let us recall that in Scheme, the backslash character (‘\’) is used to escape special 
characters in string constants. To include it within a string, it must itself be escaped. 
In fact, using a string s as a second argument of def ine-pattern yields the 
evaluation of the expression (lambda () (pattern-process s)). 

Scheme implementation of SAX. ‘SAx’ is for ‘Simple API (Application Programming 
Interface) for xml’: this name denotes a kind of parser, see [24, p. 290-292]. 




(def ine-pattern "{#1}" 

; ; The asitis element is used for words that should never be uncapitalised, that is, proper names. In BibTJjX, 
; ; we specify such behaviour by surrounding words by additional braces. 

"<asitis>#K/asitis>") 
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\if language{ }{% 

\if language{ ... » 

which becomes: 

<nonemptyinf ormation> 

<group language=’ ...’>.. .</group> 

<group language=’ ...’>.. .</group> 

< / nonempty inf ormat ion> 

The nonemptyinf ormat ion element is used for information that must be output, 
possibly in a default language if no translation into the current language is 
available. 

What we do by means of our def ine-pattern function is like the additional 
procedures in Perh® that the converter LaTeX2HTML [4] can use to translate 
additional commands. 

5 Conclusion 

Managing several formalisms can be tedious. This fact was one of main elements 
in xml’s design: giving a central formalism, able to be used for representing 
trees, and allowing many tools using different formalisms to communicate. 

BisTgX deals with three formalisms: .aux files, .bib files and ,bst files. As 
Jonathan Fine notes in [5], the applications devoted to a particular formalism 
cannot be shared with other applications. MlBisTgX attempts to use xml as far 
as possible, although there is still much to do. For example, defining a syntax for 
the entries for which we are looking, when using MlBmTgX to generate xsl-fo 
or DocBook documents. (For our tests, this list of entry names is simply given 
on the command line). 

The next step will probably be a more intensive use of xml, that is, the 
direct writing of bibliographical entries using xml conventions. For this, we 
need something more powerful than dtds, with a richer type structure, namely, 
schemas^^ . In addition, we should be able to easily add new fields to bibliograph- 
ical entries: the example given using DocBook shows that additional information 
must be able to be supplied to take advantage of the expressive power of this 
system. But such additions are difficult to model with dtds^^ . We are presently 

Practical Extraction and Report Language. 

Schemas have more expressive power than dtds, because they allow users to define 
types precisely, which in turn makes for a better validation of an xml text. In 
addition, this approach is more homogeneous since schemas are xml texts, whereas 
DTDS are not. 

There are currently four ways to specify schemas: Relax NG [3], Schematron [16], 
Examplotron [30], xml Schema [35]. At present, it seems to us that xml Schema is 
the most suitable for describing bibliographical entries. 

Whereas that is easy with ‘old’ BlsTpjX, provided that you use a bibliography style 
able to deal with additional fields. 
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going thoroughly into replacing our dtd by a schema; when this work reaches 
maturity, bibliographical entries using xml syntax could be directly validated 
using schemas. 
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Abstract. For many years the Polish T]eX Users Group newsletter has 
been published online on the GUST web site. The repository now con- 
tains valuable information on METRFONT, electronic document, 

computer graphics and related subjects. However, access to the content 
is very poor: it is available as PS/PDF files with only a simple HTML 
page facilitating navigation. There is no integration with information 
resources from other sections of the site, nor with the resources from 
other LUG or CTAN sites. 

Topic maps were initially developed for efficient preparation of indices, 
glossaries and thesauruses for electronic documents repositories, and 
are now codified as both the ISO standard (ISO/IEG 13250) and the 
XTM 1.0 standard. Their applications extend to the domain of electronic 
publishing. Topic maps and the similar RDF standard are considered to 
be the backbone of corporate knowledge management systems and/or 
the Semantic Web [3]. 

The paper contains an introduction to the Topic Maps standard and 
discusses selected problems of Topic Map construction. Finally the ap- 
plication of Topic Maps as an interface to the repository of TgjX related 
resources is presented, as well as the successes and challenges encountered 
in the implementation. 



1 Introduction 

All the papers published for the last 10 years in the bulletin of the Polish TJcX 
Users’ Group (GUST, http://www.gust.org.pl/) are now available on-line 
from the GUST Web site. The repository contains valuable information on 
METRFONT, electronic documents, computer graphics, typography and 
related subjects. However, access to the content itself is very poor: the papers 
are available as PS/PDF files with only a simple HTML interface facilitating 
navigation. There is no integration with other resources from that site. As CTAN 
and other LUGs’ sites provide more resources it would obviously be valuable to 
integrate them too. 

At first glance, the Topic Maps framework appears to be an attractive way 
to integrate vast amounts of dispersed TgX related resources. A primary goal 
of the proposed interface should be to support learning. If the project succeeds. 



A. Syropoulos et al. (Eds.): TUG 2004, LNCS 3130, pp. 216—228, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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we hope it will change slightly the opinion of TJcX as a very difficult subject to 
become acquainted with. 

The paper is organized as follows. The standard is introduced and selected 
problems of topic maps construction are discussed in the subsequent three sec- 
tions. Then a short comparison of Topic Maps and RDF is presented. The 
application of Topic Maps as an interface to the GUST resource repository is 
described in the last two sections. 

2 What Is a Topic Map? 

Topic Maps are an SGML/HYTIME based standard defined in [1] (ISO/IEG 
13250, often referred to as HYTM). The standard was recently rewritten by 
an independent consortium, TopicMaps.org [19] and renamed to XML Topic 
Maps (XTM). XTM was developed in order to simplify the ISO specification 
and enable its usage for the Web through XML syntax. Also, the original link- 
ing scheme was replaced by XLINK/XPOINTER syntax. XTM was recently 
incorporated as an Annex to [1]. 

The standard enumerates the following possible applications of TMs [1]^: 

— To qualify the content and/or data contained in information objects as 
topics, to enable navigational tools such as indexes, cross-references, citation 
systems, or glossaries. 

— To link topics together in such a way as to enable navigation between them. 

— To filter an information set to create views adapted to specific users or pur- 
poses. For example, such filtering can aid in the management of multilingual 
documents, management of access modes depending on security criteria, de- 
livery of partial views depending on user profiles and/or knowledge domains, 
etc. 

— To add structure to unstructured information objects, or to facilitate the 
creation of topic-oriented user interfaces that provide the effect of merging 
unstructured information bases with structured ones. 

In short, a topic map is a model of knowledge representation based on three key 
notions: topics which represent subjects, occurrences of topics which are links 
to related resources, and associations (relations) among topics. 

A topic represents, within an application context, any clearly identified and 
unambigous subject or concept from the real world: a person, an idea, an object 
etc. 

A topic is a instance of a topic type. Topic types can be structured as 
hierarchies organized by superclass-subclass relationships. The standard does 
not provide any predefined semantics to the classes. Finally, topic and topic 
type form a class-instance relationship. 

Topic have three kinds of characteristics: names (none, one, or more), oc- 
currences, and roles in associations. The links between topics and their related 

^ Examples of application of Topic Maps to real world problems can be found in [9, 
21,5,18,11,12]. 
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information (web page, picture, etc.) are defined by occurrences. The linked re- 
sources are usually located outside the map. XTM uses a simple link mechanism 
as defined in XLINK, similar to HTML hyperlinks. 

As with topics, occurrences can be typed; occurrence types are often referred 
as occurrence roles. Occurrence types are also defined as topics. Using XML 
syntax, the definition of topic is quite simple: 

<topic id="t-przechlewska-wanda"> 

<instanceOf > <topicRef xlink:href="#person"/> </instanceOf > 
<baseNcune> 

<baseNcmieString>Plata-Przechlewska, Wanda</baseNaineString> 
</baseNajne> 

</topic> 

Topic associations define relationships between topics. As associations are 
independent of the resources (i.e., the data layer) they represent added-value 
information. This independency means that a concrete topic map can describe 
more than one information pool, and vice versa. Each association can have 
an association type which is also a topic. There are no constraints on how 
many topics can be related by one association. Topics can play specific roles 
in associations, described with association role types - which are also topics. 

The concepts described above are shown in Fig. 1. Topics are represented 
as small ovals or circles in the upper half of the picture while the large oval at 
the bottom indicates data layer. Small objects of different shapes contained in 
the data layer represent resources of different types. The lines between the data 
layer and topics represent occurrences, while thick dashed ones between topics 
depict associations. 

Besides the above three fundamental concepts, the standard provides a notion 
of scope. All characteristics of topics are valid within certain bounds, called a 
scope, and determined in terms of other topics. Typically, scopes are used to 
model multilingual documents, access rights, different views, and so on. 

Scopes can also be used to avoid name conflicts when a single name denotes 
more than one concept. An example of scope for the topic latex might be 
computer application or rubber industry depending on the subject of the topic. 
Only the topic characteristics can be scoped, not the topic itself. 

3 Subject Identity and Map Merging 

From the above short tour of TM concepts it should be clear that there is 
an exact one-to-one correspondence between subjects and topics. Thus, the 
identification of subjects is crucial to individual topic map applications and to 
interoperability between different topic maps. 

The simplest and most popular way of identifying subjects is by identifyng 
them via some system of unique labels (usually URIs). A subject identifier is 
simply a URI unambiguously identifying the subject. If the subject identifier 
points to a resource (not required) the resource is called a subject indicator. The 
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Fig. 1. Topic map and resource layer. 



subject indicator should contain human-readable documentation describing the 
non-addressable subject [22]. 

As there are no restrictions to prevent every map author from defining 
their own subject identifiers and resource indicators, there is a possibility that 
semantic or syntactic overlap will occur. To overcome this, published subject 
indicators (PSIs) are proposed [17]. PSIs are stable and reliable indicators 
published by an institution or organization that desires to promote a specific 
standard. Anyone can publish PSIs and there is no registration authority. The 
adoption of PSIs can therefore be an open and spontaneous process [17,6]^. 

Subject identity is of primary importance for topic map merging when there 
is a need to recognize which topics describe the same subject. 

Two topics and their characteristics can be merged (aggregated) if the topics 
share the same name in the same scope {name-based merging), or if they refer to 
the same subject indicator {subject-based merging). Merging results in a single 
topic that has the union of all characteristics of merged topics. Merged topics play 
the roles in all the associations that the individual topics played before [22, 15]. 



^ For example, the XTM 1.0 specification contains a set of PSIs for core concepts, such 
as class, instance, etc., as well as for the identification of countries and languages [19]. 
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4 Constraining, Querying, and Navigating the Map 

The notion of a topic map template is used frequently in literature. As the name 
suggests, a topic map template is a sort of schema imposing constraints on topic 
map objects with TM syntax. The standard does not provide any means by 
which the designer of the TM template can put constraints onto the topic map 
itself. Standardisation of such constraints are currently in progress [14]. 

Displaying lists of indexes which the user can navigate easily is the standard 
way of TM visualization. As this approach does not scale well for larger maps, 
augmenting navigation with some sort of searching facility is recommended. 
Other visualization techniques such as hyperbolic trees [15], cone trees, and 
hypergraph views (Fig. 2) can be used for visualization and navigation of topic 
maps. They display TMs as a graph, with the topics and occurrences as nodes 
and the associations as arcs. The drawback of such ‘advanced’ techniques is that 
users are usually unfamilar with them. 




Fig. 2. Hypergraph visualization with TMNav. 



There are several proposed query languages for topic maps. None of them 
are part of the standard and there are inconsistencies in different TM engines. 
Two of the most prominent proposals are: 

— TMQL {Topic Maps Query Language^ [9]), with SQL-like syntax, provides 
both for querying and modifying topic maps (select, insert, delete, update). 

— Tolog, inspired by the logic programming language Prolog, supports require- 
ments for TMQL with clearer and simpler syntax. 
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The introduction to the TM standard presented in this paper does not cover 
all the details of the technology. Interested readers can find an exhaustive de- 
scription in [15], which contains a detailed introduction with numerous examples, 
and [16]. 

5 Topic Maps and RDF 

The W3C promotes the Resource Description Framework (RDF) [10] as another 
framework for expressing metadata. RDF is a W3C standard envisioned to be 
a foundational layer of the Semantic Web. 

The fundamental notion in the RDF data model is a statement, which is a 
triple composed of a resource, property, and value. The RDF Schema (RDFS) [4] 
is a W3C working draft aimed at defining a description language for vocabularies 
in RDF. More expressive RDFS models have been proposed recently [23]. 

One key difference between RDF and topic maps is that topic maps are 
modelled on a concept-centric view of the world. For example, in RDF there 
are no ‘predefined’ properties, so to assign a name to a resource one has to use 
another standard (such as Dublin Core), something that is not necessary with 
topic maps. The notion of scope is also absent from RDF too. 

The RDF and Topic Maps standards are similar in many respects [7]. Both 
offer simple yet powerful means of expressing concepts and relationships. 

6 Building Topic Maps 

for the GUST Bibliographic Database 

Similar to writing a good index for a book, creating a good topic map is carried 
out by combining manual labour with the help of some software applications. It 
is usually a two-stage task, beginning with the modelling phase of building the 
‘upper-part’ of the map, i.e., the hierarchy of topic and association types (the 
schema of the map) and then populating the map with instances of topic types, 
their associations and occurrences. 

Approaches for developing a topic map out of a pool of information resources 
include [2]: 

— using standard vocabularies and taxonomies (i.e., www.dmoz.org) to be the 
initial source of topics types. 

— generating TMs from the structured databases or documents with 
topic types and association types derived from the scheme of the database/ 
document . 

~ extraction of topics and topic associations from pools of unstructured or 
loosely structured documents using NLP (Natural Language Processing) 
software combined with manual labour. 

The first approach is concerned with the modelling phase of topic map creation, 
while the third one deals with populating the map. 
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Following the above guidelines, the structure of the BmTgX records was an 
obvious candidate to start with in modelling our map of GUST articles. It 
provides a basic initial set of topics including: author, paper, keyword, and the 
following association types: author-paper, paper-keyword and author-keyword. 
Abstracts (if present in BiBTgX databases) can be considered as occurrences of 
the topic paper. The publication date and language can be used as scopes for 
easy navigation, using them as constraints. 

Other TAOs (topics, associations, and occurrences [16]) to consider are: 
author home pages (occurrence type), applications described in the paper (as- 
sociation type), papers referenced (association type). This information is absent 
from BiBTgX files but, at least theoretically, can be automatically extracted from 
the source files of papers. 

We started by analyzing the data at our disposal, i.e., T[;]X and BibT^X source 
files. Unfortunately, in the case of the GUST bulletin the BibT^X database was 
not maintained. This apparent oversight was rectified with simple Perl scripts 
and a few days of manual labour. The bibliographic database was created and 
saved in a XML-compatible file^. 

documents are typically visually tagged and lack information oriented 
markup. The only elements marked up consistently and unambiguously in the 
case of the GUST bulletin are the paper titles and authors’ names. Authors’ 
home pages were rarely present, while email addresses were available but not 
particularly useful for our purposes. Neither abstracts nor keyword lists had 
been required and as a consequence were absent from the majority of the papers. 
Similarly, any consistent scheme of marking bibliographies (or attaching .bib 
files) was lacking, so there was no easy way to define the related to association 
between papers. 

The benefit derived from keywords is much greater if they are applied con- 
sistently according to some fixed classification; otherwise, the set of keywords 
usually consists of many random terms which are nearly useless. Since we didn’t 
want to define yet another ‘standard’ in this area, we would have liked to 
adopt an existing one. The following sources were considered: the TgX entry at 
dmoz.org, Graham Williams’ catalogue^, collections of BmTgX files and .tpm 
files [20] 

The accuracy of the TgX taxonomy subtree at dmoz . org was somewhat 
questionable, and we quickly rejected the idea of using it. Williams’ catalogue 
of d)5]X resources does not include any information except the location of the 
resource in the structure of CTAN. As for BiBTgX files, it appeared only MAPS 
and TUGBoat were complete and up-to date® but only the latter contains 
keywords. Unfortunately, they do not comply with any consistent scheme. Due 
to the lack of any existing standard, the keywords were added manually on a 



® We reused the XML schema developed for the MAPS bulletin (http://www.ntg. 
org. In/maps/). 

^ http : //www. ctan. org/tex-archive/help/Catalogue 

® Cahiers GUTenberg was not found, but the impressive portal of Association 
GUTenberg indicates appropriate metadata are maintained, but not published. 
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commonsense basis, with the intention of being ‘in sync’ with the most frequent 
terms used in MAPS®. 

Finally the following TAOs were defined (the language of the publication 
was considered to be the only scope): 

— topic types: author^ paper, and keyword] 

— association types: author-paper, paper-keyword , and author-keyword] 

— occurrence types: papers and abstracts. 

The schema of the map was prepared manually and then the rest of the map 
was generated from intermediate XML file with an XSLT stylesheet [8, 13]. The 
resulting map consists of 454 topics, 1029 associations, and 999 occurrences. A 
fragment of the map rendered in a web browser with Ontopia Omnigator (a 
no-cost but closed-source application, http://www.oiitopia.net/download/) is 
shown in Fig. 3. 
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Fig. 3. A fragment of GUST topic map rendered with Omnigator. 



Omnigator shows the map as a list of links arranged in panels. Initially only 
a list of subjects (index of topic types) is displayed. When a link for a topic is 
clicked on, the topic is displayed with all the information about its characteristics 

® There are 814 bibliographic entries in MAPS base and 895 different keywords. The 
most popular keywords in MAPS BibTj^X file are: 1 AI]eX“ 51) NTG-42, plain TIeX- 
37, PostScript-28, GonTlEXt, Tj^X-NL, METfiFONT, SGML, and so forth. There are 
small number of inconsistent cases (special commands vs. specials, or configuration 
vs. configuring) and hne-grained keywords (Portland, Poland, Bachotek, USSR!). 
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(names, associations, occurrences). In Fig. 3, an example page for author Kees 
van der Laan is shown. The right panel contains all the relevant resources while 
the lower left has all the related topics, i.e., papers written by Kees and other 
subjects covered. The user can easily browse both papers authored by him and 
switch to pages on some other interesting subject. The panel with resources 
contains information on the resource type allowing fast access to the required 
data. 

Similar functionality can be obtained with the freely available TM4Nav or 
even by using a simple XSLT stylesheet [13]. 

7 Integrating Other Tg[X Resources 

So far there is nothing in TMs that cannot be obtained using other technologies. 
The same or better functionality can be achieved with any database management 
system (DMS). But integrating T[^]X resources on a global scale needs flexibility, 
which traditional RDBMS-based DMS applications lack. For example, topic 
maps can be extended easily through merging separate maps into one, while 
DMS-based extensions usually require some prior agreement between the parties 
(e.g., LUGs), schema redefinitions, and more. 

To verify this flexibility in practice, we extended the GUST map with the 
MAPS and TUB BraTgX databases. For easy interoperability in a multi-lan- 
guage environment, the upper half of the map was transferred to a separate file. 
With the use of scope, the design of multi-language topic types was easy, for 
example: 

<topic id="english"> 

<sub j ectldentity> 

<subjectIndicatorRef 
xlink : href =" http : //www . topicmaps . org/\ 
xtm/1 . 0/language . xtm#en"/> 

</ subjectldentity> 

<baseNajne> 

<baseNameString>EN</baseNameString> 

</baseNaine> 

</topic> 

<topic id="author"> 

<baseNcune><scope> 

<topicRef xlink : href ="#english"/></ scope> 
<baseNaineString>author</baseNaineString> 

</baseName> 

<baseNcune><scope> 

<topicRef xlink:href="#polish"/> </scope> 
<baseNaineString>autor</baseNaineString> 

</baseName> 

</topic> 
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Other topic types were designed similarly. Scopes for other languages can easily 
be added. 

The ‘lower part’ of the map was generated from (cleaned) BiBTgX records 
with bibtex2xml .py (http://bibtexml.sf.net) and than transformed to 
MAPS XML with an XSLT stylesheet. Keywords were added to TUB entries 
using a very crude procedure^. 
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Fig. 4. Topic map fragment from Fig. 3 scoped to Polish language. 



Simple name-based merging of all three maps results in over 25,000 TAOs 
(« 1000 authors, more than 2000 papers). Some of the subjects were represented 
with multiple topics. As an example the Grand Wizard was represented as the 
following four distinct topics: ‘Knuth, Don’, ‘Knuth, Donald’, ‘Knuth, Donald 
E.’, ‘Knuth., Donald E.’®. 

As identity-based merging is regarded as more robust, some identifiers have 
to be devised first. Establishing a PSI for every author seemed overly 
ambitious. Instead, a dummy subject identifier was chosen, such as: http:// 
tug. org/authors#initials-surnamie. This can still produce multiple topics 
for the same subject, but now we can eliminate unwanted duplicates by defining 
an additional map consisting solely of topics like the following [18]: 

^ Acronyms, such as DILX) METRFONT, or XML, present in the title were used as 
keywords. 

® First name variants, abbreviations and middle names cause problems in many more 



cases. 
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<topic id="de-knuth"> 

<sub j ectldentity> 

<subjectIndicatorRef 

xlink="http : //tug. org/authors#d-knuth"/> 

<subjectIndicatorRef 

xlink= "http : // tug . org/ authors#d-e-knuth" / > 

</ sub j ectldentity> 

</topic> 

Merging this map with the ‘base’ map(s) will result in a map free of unwanted 
duplicated topics with all variant names preserved. 

For further extensions, we plan to incorporate CTAN resources. For that 
purpose, Williams’ catalogue and/or the TPM files from TJllX Live project can be 
used. As the catalogue contains author names, it would be for example possible 
to enrich the map with the author- application association. Further enrichment 
will result if we can link applications with documents describing them. However, 
some robust classification scheme of TgX resources should be devised first. 

8 Topic Map Tools 

As with any other XML-based technology, topic maps can be developed with 
any text editor and processed with many XML tools. However, for larger-scale 
projects specialized software is needed. There are a few tools supporting topic 
map technology, developed both commercially and as Open Source projects. We 
have considered both Ontopia Omnigator (mentioned in the previous section) 
and TM4J (free software). 

TM4J (http://tm4j.org) is a suite of Java packages which provide in- 
terfaces and default implementations for the import, manipulation and export 
of XML Topic Maps. Features of the TM4J engine include an object model 
which supports XTM specification with the ability to store topic map in an 
object-oriented or relational database, and an implementation of the tolog query 
language. 

Based on TM4J a few projects are in progress: TMNav for intuitive navi- 
gation and editing of topic maps, and TMBrowse for publishing maps as set of 
HTML pages (similarly to Omnigator). 

These projects are in early stages and our experience with TMBrowse indi- 
cates that current version frequently crashes with bigger maps and is significantly 
slower than Omnigator. There were problems with tolog queries as well. 

As all these projects are actively maintained progress may be expected in 
the near future. 

9 Summary 

Topic maps are an interesting new technology which can be used to describe the 
relation between TgX resources. The main problem is topic map visualization. 
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Available tools are in many cases unstable and non-scalable, but we can expect 
improvement. 

The system presented here can certainly be improved. It is planned to extend 
it with the content of Williams’ catalogue. The maps developed in the project are 
available from http://gnu.univ.gda.pl/~tomasz/tm/. At the same address, 
the interested reader can find links to many resources on topic maps. 
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Abstract. While R5X [4] provides high quality typesetting features, 
its usability suffers due to its macro-based command language. Many 
tools have been developed over the years simplifying and extending the 
TFjX interface, such as R'RjX [5], IMFiXS [6], pdflf^X [2], and NTS [8]. 
Front-ends such as 'R^macs [10] follow the visual/graphical approach to 
facilitate the coding of documents. The system introduced in this paper, 
however, is radical in its targetting of optimized code appearance. 

The primary goal of §aferRgX is to make the typesetting source code as 
close as possible to human-readable text, to which we have been accus- 
tomed over the last few centuries. Using indentation, empty lines and a 
few triggers allows one to express interruption, scope, listed items, etc. 
A minimized frame of ‘paradigms’ spans a space of possible typesetting 
commands. Characters such as and “$’ do not have to be backslashed. 
Transitions from one type of text to another are automatically detected, 
with the effect that environments do not have to be bracketed explicitly. 
The following paper introduces the programming language §aferTF[X as 
a user interface to the RjiX typesetting engine. It is shown how the 
development of a language with reduced redundancy increases the beauty 
of code appearance. 



1 Introduction 

The original role of an author in the document production process is to act as an 
information source. To optimize the flow of information, the user has to be freed 
from tasks such as text layout and document design. The user should be able to 
delegate the implementation of visual document features and styles to another 
entity. With this aim in mind, the traditional relationship between an author 
and his typesetter before the electronic age can be considered the optimal case. 
Modern technology has increased the speed and reduced the cost of document 
processing. However, the border between information specification and document 
design has blurred or even vanished. 

In typesetting engines with a graphical user interface, an editor often takes 
full control over page breaks, font sizes, paragraph indentation, references and so 
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on. Script-oriented engines such as take care of most typesetting tasks and 
provide high quality document design. However, quite often the task to produce 
a document requires detailed insight into the underlying philosophy. 




Fig. 1. Neo-traditional typesetting. 



§aferTJ^X tries to get back to the basics, as depicted in Figure 1. Like the 
traditional writer, a user shall specify information as redundancy-free as possible 
with a minimum of commands that are alien to him. Layout, features, and styles 
shall be implemented according to predefined standards with a minimum of 
specification by the user. 

To the user, the engine provides a simple interface, only requiring plain text, 
tables and figures. A second interface allows a human expert to adapt the engine 
to local requirements of style and output. Ideally, the added features in the 
second interface do not appear to the user, but are activated from the context. 
Then, the user can concentrate on the core information he wants to produce, 
and not be distracted by secondary problems of formatting. 

The abovementioned ideal configuration of engine, user, and expert can 
hardly be achieved with present automated text formatting systems. While 
relying on T[^]X as a typesetting engine, §aferT[;]X tries to progress towards a 
minimal-redundancy programming language that is at the same time intuitive 
to the human writer. 



2 The §afer'I^]|X Engine 

As shown in Figure 2, the §aferlhlX engine is based on a three-phase compilation, 
namely: lexical analysis, parsing and code generation. Along with the usual 
advantages of such modularization, this structure allows us to describe the engine 
in a very formal manner. In this early phase of the project, it further facilitates 
adding new features to the language. Using interfacing tools such as SWIG [I] 
and .NET [9], it should be possible in the future to pass the generated parse tree 
to different programming languages. Such an approach would open a whole new 
world to typesetters and document designers. Plug-ins for §aferTj;(]X could then 
be designed in the person‘s favorite programming language (C-I--I-, Java, Python, 
C#, anything). Currently, automated document production mainly happens by 
preprocessing UT[;<]X code. Using a parse tree, however, gives access to document 
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Fig. 2. The §aferT^5X compilation process. 



contents in a structured manner, i.e., through dedicated data structures such as 
objects of type section, item group, and so on. 

GNU flex [7] (a free software package), is used to create the lexical analyzer. 
It was, though, necessary to deviate from the traditional idea of a lexical analyzer 
as a pure finite state automaton. A wrapper around flex implements inheritance 
between modes (start conditions). Additionally, the lexical analyzer produces 
implicit tokens and deals with indentation as a scope delimiter. 

The parser is developed using the Lemon parser generator [3] (also free 
software). Using such a program, the §aferdj;;]X language can be described with a 
context-free grammar. The result of the parser is a parse tree, which is currently 
processed in C-|— h. The final product is a UT^X file that is currently fed into 
the UT^X engine. A detailed discussion of the engine is not the intention of this 
paper, though. The present text focuses on the language itself. 



3 Means for Beauty 

§aferdJi]X tries to optimize code appearance. The author identifies three basic 
means by which this can be achieved: 

1. The first means is intuitive treatment of characters. For example, and 
are used as normal characters and do not function as commands, as they do 
in UTeX. 

2. The second is to use indentation as the scope delimiter. This is reminiscent of 
the Python programming language. It allows the user to reduce brackets and 
enforces proper placement of scopes. For table environments, this principle 
is extended so that the column positions can be used as cell delimiters. 

3. The third principle is automatic environment detection. If an item appears, 
then the ‘itemize’ environment is automatically assumed. This reduces re- 
dundancy, and makes the source file much more readable. 

Applying these principles leads to the eight rules of §aferTEX as they are ex- 
plained at the end (section 5). We now discuss them in more detail. 
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3.1 Intuitive Treatment of Characters 

In the design of a typesetting language, the user has to be given the ability to 
enter both normal text and commands specifying document structure and non- 
text content. This can be achieved by defining functions, i.e., using character 
sequences as triggers for a specific functionality. This happens when we define, 
say, sin(x) as a function computing the sine of x. For a typesetter this is not a 
viable option, since the character chain can be easily confused with normal text. 
As a result, one would have to ‘bracket’ normal text or ‘backslash’ functions. 
Another solution is to use extra characters. This was the method Donald Knuth 
chose when he designed Tg]X [4] . The first solution is still intuitive to most users. 
The second, however, is rather confusing, implying that ‘%’, ‘$’ and ‘_’ have a 
meaning different from what one sees in the file. 

Historically, at the time T[^]X was designed, keyboards had a very restricted 
number of characters. Moreover, ASCII being the standard text encoding in 
Knuth’s cultural context, the high cost of data storage, and the lack of advanced 
programming languages also all may have contributed to the design choices 
made. Although the documents produced still equal and even outclass most 
commercial systems of our days, the input language, it must be admitted, is 
rather cryptic. 

The first step towards readability of code is to declare a maximum number 
of characters as ‘normal’. In §aferTgX, the only character that is not considered 
normal is the backslash. All other characters, such as ‘%’, ‘$’ and appear in 
the text as they are. Special characters only act abnormal if they appear twice 
without whitespace in between. These tokens fall into the category of alien things, 
meaning that they look strange and thus are expected to not appear verbatim 
in the output. 

Table 1 compares DT[;]X code to §aferT[;]X code, showing the improvement 
with respect to code appearance. The advantages may seem minor. Consider, 
however, the task of learning the difference between the characters that can 
be typed normally and others that have to be backslashed or bracketed. The 
abovementioned simplification already removes the chance of subtle errors ap- 
pearing when DT[;;]X code is compiled. The subsequent sections show how the 
code appearance and the ease of text input can be further improved. 

3.2 Scope by Indentation 

We have discussed how commands are best defined in a typesetting engine. 
One way to organize information is to create specific regions, called scopes or 
environments. Most programming languages use explicit delimiters for scopes 
without giving any special meaning to white space of any kind. This implies 
that the delimiters must be visible. C-I--I-, for example, uses curly braces, while 
DT[;<]X uses \begin{. . .} ... \end{. . .} constructs to determine scope. This 
approach allows one to place the scopes very flexibly. However, it pollutes the 
text with symbols not directly related to the information being described. The 
more scopes that are used, and the deeper they are nested, the more the source 
text loses readability. 
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Table 1. Comparison of treatment of special characters in and gaferT^^X. 



According to Balmun \& Refish 
$<$www.b-and-r . org$>$ a conversion of module 
\#5, namely ‘propulsion\_control, ’ into a metric 
system increases code safety up to 98.7\"/o at 
cost of \~ \ \$17,500. 

§aferT^ 5 X: According to Balmun & Refish <www.b-and-r.org> 
a conversion of module #5, namely 
‘propulsion_control , ’ into a metric system 
increases code safety up to 98.7"/ at cost of 
~ $17,500. 



Another approach is scoping by indentation. A scope of a certain indentation 
envelopes all subsequent lines and scopes as long as they have more indentation. 
Figure 3 shows an example of scope by indentation. IAT[?;X‘s redundancy-rich 
delimiters add nothing but visual noise to the reader of the file. §aferTgX, 
however, uses a single backslashed command \quote in order to open a quote 
domain. The scope of the quote is then simply closed by the lesser indentation 
of the subsequent sentence. 



Einstein clearly stated his disbelief in the 
boundedness of the human spirit as becomes 
clear through his sayings : 

\quote The difference between genius and 

stupidity is that genius has its limits. 

Only two things are infinite, the 
universe and human stupidity, and I’m 
not sure about the former. 

Similar reports have been heard from Frank 
Zappa and others . 

Fig. 3. Scope by indentation. 



This simple example was chosen to display the principle. It is easy to imagine 
that for more deeply nested scopes (e.g., picture in minipage in center in 
figure), IAT[?;X code converges to unreadability, while ^aferTJcX code still allows 
one to get a quick overview about the document structure. Scope by indentation 
has proven to be a very convenient and elegant tool. 

An extension of this concept is using columns as cell delimiters in a table 
scope. The implementation of tables in ^aferTJilX allows the source to omit many 
‘parboxes’ and explicit ‘&’-cell delimiters. To begin with, a row is delimited by 
an empty line. This means that line contents are glued together as long as only 
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one line break separates them. The cell content, though, is collected using the 
position of the cell markers and ‘II’. Additionally, the symbol glues 
two cells together. This makes cumbersome declarations with \multicolumn and 
\parbox unnecessary. Figure 4 shows an example of a §aferT[;]X table definition. 



\table Food suppliers, prices and amounts. 



Product I I Price/kg && Supplier && kg I I Total Price 



Sugar II $0.25 && Jackie O'Neil && 34 I I $8.50 



Yellow Swiss I I $12.2 && United Independent && 100 I I $1220.00 

Cheese Farmers of 

Switzerland 



Green Pepper I I $25 . 0 && Anonymous && 2 I I $50 . 00 

Genuine IndicUis Tribes 

Mexican 



Sum 



&& $1278.50 



Fig. 4. Example of writing a table: identifying cell borders by column. 



4 Implicit Environment Detection 

A basic means of improving convenience of programming is reducing redundancy. 
In IA11;]X, for example, the environment declarations are sometimes unnecessary. 
To declare a list of items, one has to specify something like 

\begin{itemize} 

\item This is the first item and 
\item this one is the second. 

\end{itemize} 

Considering the information content, the occurrence of the \item should 
be enough to know that an itemize environment has started. Using our second 
paradigm, ‘scope by indentation’, the closing of the environment could be de- 
tected by the first text block that has less indentation than the item itself. The 
\begin and \end statements are therefore redundant. In §aferTl;]X, the token 
‘ — ’ (two dashes) is used to mark an item. Thus, in §aferTEX, the item list above 
simply looks like: 

— This is the first item and 

— this one is the second. 
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As implied previously, this paradigm’s power really unfolds in combination 
with scope by indentation. Subsequent paragraphs simply need to be indented 
more than the text block to which they belong. Nested item groups are specified 
by higher levels of indentation, as seen in figure 5. 



Items provide a good means to 

— structure information 

— emphasize important points. There are 

three basic ways to do this: 

[[Numbers]]: Enumerations are good when 
there is a sequential order 
in the information being 
presented. 

[[Descriptions]]: Descriptions are 

suitable if keywords 
or key phrases are 
placeholders for more 
specific information. 

[ [Bullets] ] : Normal items indicate that 
the presented set of 
information does not 
define any priorization. 

— classify basic categories 

There may be other things to consider of 
which the author is currently unaware. 

Fig. 5. Example code showing ‘scope by indentation’. 



Some important points from the example: 

~ The appearance of a ‘ — ’ at the beginning of a line tells ^aferTJilX that there 
is an item and that an implicit token ‘list begin’ has to be created before the 
token ‘item start’ is sent. The next ‘ — ’ signals the start of the next item. 

— The ‘ [ [’-symbol appears at the beginning of the line. It indicates a descriptor 
item. Since it has a higher indentation than the ‘ — ’ items, it is identified 
as a nested list. Therefore, an implicit token ‘list begin’ has to be created 
again. 

— The final sentence having less indentation than anything before closes all 
lists, i.e., it produces implicit ‘list end’ tokens for all lists that are to be 
closed. Thus, the parser and code generator are able to produce environment 
commands corresponding to the given scopes. 
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Now that the fundamental ideas to improve programming convenience for a 
typesetting system have been discussed, we turn to defining a best set of rules 
for expressions that implement these rules. 



5 The Eight Rules of §aferl^^ 

Rules for command design shall be consistent with the paradigms of intuitive 
treatment of characters, scope by indentation, and automatic environment de- 
tection. The following set of rules was designed to meet these goals for §aferTJ;]X 
while striving for a intuitive code appearance: 

[1] Every character and every symbol in the code appears in the final output 
as in the source document, except for Alien things. 

[ 2 ] Alien things look alien. 

In plain characters such as ‘$’, and do not appear in the document 
as typed. The fact that they look natural but trigger some TgX specific behavior 
is prone to confuse the layman. In §aferT[;]X, they appear as typed on the screen. 
Alien things can be identified by their look. The next four rules define the ‘alien 
look:’ 

[3] Any word starting with a single backslash \. Examples are \figure and 
\table. 

[4] Any non-letter character that appears twice or more, such as ‘##’ (this 
triggers the start of an enumeration item at the beginning of the line). 

[5] Parentheses (at the beginning of a line) that only contain asterisks or 
whitespace. Sequences such as ‘(*)’, ‘( )’, ‘ (***)’ indicate sections and 
subsections. 

[6] The very first paragraph of the file. It is interpreted as the title of the 
document . 

Except for the first case, alien things do not interfere with readability. In fact, 
the double minus ‘ — ’ for items and the ‘(*)’ for sections are used naturally in 
many ASCII files. Internally, alien things are translated into commands for the 
typesetting engine, but the user does not need to know. 

The last two issues are separation of the text stream and identification of 
scope of an environment: 

[7] Termination of paragraphs, interruptions of the text flow, etc., are indicated 
by an empty line. 

[8] The scope of an environment, table cells, etc. is determined by its inden- 
tation. A line with less indentation closes all scopes of higher indentation. 

These are the eight rules of ^aferTgX that enable one to operate the typeset- 
ter. They are defined as ‘rules’ but, in fact, they do not go much beyond common 
organization of text files. 
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Details about 

The Elves and The Shoemaker 



\Author Original : 

Brothers Jakob & Wilhelm Grimm 
Somewhere in Germany 



( ) Abstract abstract . st 

(*) Nocturne shoe productions strange . st 

(**) Living in confusion confusion. st 

(**) Women make trouble trouble . st 

(*) Midnight observations midnight . st 

(**) Elves in the cold freezing-elves . st 

(*) New era for elves: luxury luxury. st 

(*) Elves leave their job undone spoiled-elves . st 



Fig. 6. Example input ‘main.st’. 



6 Commands 

This section gives a brief overview of the commands that are currently imple- 
mented. In this early stage of development, the system’s structure and language 
design has been in the foreground, in order to build the framework for a more 
powerful typesetting engine. In the current version of §aferTJj;X, the following 
commands are implemented: 

— , ++ starts a bullet item. The two can be used interchangeably to distinguish 
different levels of nested item groups. 

## starts an enumeration item. 

[ [ ] ] bracket the beginning of a description item. 

\table opens a table environment. It is followed by a caption and the table 
body as described in section 3.2. 

\figure opens a figure environment. The text following this command is inter- 
preted as the caption. Then file names of images are to be listed. Images 
that are to be shown side by side are separated by Vertically adjacent 
images are separated by empty lines. 

\quote opens a quote environment. 

(*) starts a section. The number of asterisks indicates the level of the section. 
( ) starts a section without a section number. The number of blanks indicates 
the section level. 

.... includes a file (more than four dots is equivalent to four). The next non- 
whitespace character sequence is taken as the filename to be included. 
\author specifies information about the author of the document. 
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\figure : : f igrplots : : Performance a) productivity of shoemaker, b) gain, 
ferry-tales/prod. eps && f erry-tales/capital . eps 

Reviewing the plots of shoes produced (figure — <f ig:plots>) , the shoemaker 
realized an instantaneous increase during the night period. He only could 
think of two possible reasons: 

## He was sleepworking. Since he even used to work few when awake this 
assumption was quickly refuted. 

## Elves must have come over night and did some charity work. 

He further based his theory on the influence of the tanning material used. 
In fact, there were differences in the number of shoes produced depending 
on acid number and pH value (see table — <tab:tan-mat>) . 

\table :: tab : tan-mat Inf luence of tanning materials on shoe production. 



Tanning Mat . 


&& 


pH value 


&& 


acid 


number && 


shoes prod. 




European 


&& 


3.4 - 3.7 




30 


- 40 M 


32 


@@ 


Indian 




2.0 - 2.1 




31 


- 45 


35 


@0 


African 




4.5 - 4.6 




33 


- 37 


36 


@0 


Australian 




3.0 - 7.0 




27 


- 45 


15 


@0 



Resourcing several leathers from Indian & african suppliers allowed him to 
increase profit ranges tremendously. Moreover, these shoes were sold at an 
even higher price around $0.50. Pretty soon, the shoemaker was able to 
save a good sum of $201.24. 



Fig. 7. Example input ‘confusion. st’. 



Commands have been designed for footnotes, labels, and more. However, due 
to the early stage of development, no definite decision about their format has 
been made. In the appendix, two example files are listed in order to provide an 
example of §aferT[;]X code in practical applications. 



7 Conclusion and Outlook 



Using simple paradigms for improving code appearance and reducing redun- 
dancy, a language has been developed that allows more user-friendly input than 
is currently possible with and UTgX. These paradigms are: the intuitive pro- 
cessing of special characters, the usage of indentation for scope and the implicit 
identification of environments. As an implementation of these paradigms, the 
eight rules of §aferTgX were formed, which describe the fundamental structure 
of the language. 
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While developing ^aferT^X, the author quickly realized that the ability to 
provide the parse tree to layout designers extends the usage beyond the domain 
of TgX. Currently, much effort remains to provide appropriate commands for 
document production. Functionality of popular tools such as psfrag, fancyheaders, 
bibtex, makeindex, etc., are to be implemented as part of the language. In the 
long run, however, it may be interesting to extend its usage towards a general 
markup language. 
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Abstract. This paper summarizes experiences in converting METRFONT 
fonts to PostScript fonts with T^Xtrace and mftrace, based on programs 
of autotracing bitmaps (AutoTrace and potrace), and with systems us- 
ing analytic conversion (MetaFog and MetaTypel, using METRPOST 
output or METRPOET itself). A development process is demonstrated 
with public Indie fonts (Devanagari, Malayalam) . Examples from the 
Computer Modern fonts have been also included to illustrate common 
problems of conversion. Features, advantages and disadvantages of vari- 
ous techniques are discussed. Postprocessing - corrections, optimization 
and (auto)hinting - or even preprocessing may be necessary, before even 
a primary contour approximation is achieved. To do fully automatic 
conversion of a perfect METRFONT glyph definition into perfect Type 1 
outline curves is very difficult at best, perhaps impossible. 

Keywords: font conversion, bitmap fonts, METRFONT, METRPOET, 
outline fonts, PostScript, Type 1 fonts, approximation, Bezier curves. 



1 Introduction 

In recent years, several free programs for creating PostScript outline fonts from 
METRFONT sources have been developed. The aim of this paper is to give a 
short comparison of these programs, with references to original sources and 
documentation, and to provide a brief description of their use. We will discuss 
advantages and drawbacks, and demonstrate numerous examples to compare 
important features and to illustrate significant problems. We omit technical 
details described in the original documentation and concentrate our attention 
on the quality of the output, including hinting issues. 

The programs T^Xtrace and mftrace read original METHFONT sources, gener- 
ate high-resolution pk bitmaps, call autotracing programs (AutoTrace or potrace) 
and finally generate the files in the Type 1 format (pfb or pfa). 

MetaTypel creates Type 1 output from METRPOST sources. Therefore it 
requires rewriting font definitions from METRFONT into METRPOST. 

Similarly, MetaFog converts the PostScript files generated by METRPOST 
to other PostScript files containing only outlines, that can be subsequently 



A. Syropoulos et al. (Eds.): TUG 2004, LNCS 3130, pp. 240—256, 2004. 
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assembled into Type 1 fonts. MetaFog is not a new product, but its excellent 
results remain, in our comparisons, unsurpassed. 

Additionally, we may need adequate encoding files. If none are available, 
a Tg]X encoding (e.g., the standard T1 encoding) is usually used as the 
default. 

2 Autotracing Bitmaps 

2.1 T^trace with AutoTrace 

Peter Szabo developed T^Xtrace [18]. It is a collection of Unix scripts. It reads 
the original METflFONT sources, rendering the font bitmaps into PostScript (via 
dvips). For converting the resulting bitmaps to outlines, it calls (in the version 
of 2001) the AutoTrace program [21] created by Martin Weber, and, finally, 
composes the final files in the Type 1 format. T^Xtrace works fully automatically 
and can be invoked by a command like this: 

bash traceall . sh mfname psname psnumher 
where mfname. mf is the name of the METflFONT font, psname. pfh is the name 
of the Type 1 font file, and psnumber denotes a Type 1 UniquelD [1]. 

The Adobe Type 1 Font Format documentation [1, pp. 29-33] recommends 
observing certain Type 1 conventions: 1) points at extremes; 2) tangent conti- 
nuity; 3) conciseness; and 4) consistency. 



rfi » 









rfl g 




^ J 





Fig. 1. TFjXtrace: in cmrlO. 



The outline results from T^Xtrace (that is, from AutoTrace) are relatively 
faithful to the original bitmaps. Some artifacts exist, but they are invisible in 
usual font sizes and magnifications and for practical purposes may be negligible. 
Nonetheless, they spoil our attempts to automatically produce perfect, hinted, 
outline fonts. 

The underlying reason is that the information about the control points in 
the original METflFONT is lost, and the Type 1 conventions are not satisfied, 
as exemplified in Figure 1. The endpoints (double squares) are not placed at 
extremes (rule 1), most of the horizontal and vertical points of extrema are 
missing. On the other hand, the outline definition is not concise (rule 3) - due 
to the large numbers of control points in the glyph definitions, the font files 
generated by T^Xtrace are huge. Furthermore, the two identical periods in the 
dieresis glyph are approximated by different point sets (rule 4). 

The following examples show the results of conversion of Indie fonts sub- 
mitted to TUG India 2002 [16], devanagari (dvnglO) and Malayalam (mmlO). 
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Fig. 2. Results of T^Xtrace (AutoTrace): bumps and a hole (h). 



Typical irregularities produced by conversion with T^Xtrace are bumps and holes. 
Figure 2 demonstrates bumps caused by the envelope being stroked along a path 
with a rapid change of curvature, and by cases of transition from a straight line 
to a particularly small arc. The second clipped part of the letter “pha” shows a 
hole. 

I tried to remove those bumps and holes, and (partially) other irregularities 
at the Type 1 level with a set of special programs manually marking places to 
be changed in a “raw” text, translated by tldisasm and by tlasm back after 
modifications (both programs are from the tlutils package [13]), which achieves a 
better outline approximation, as shown in Figure 3. The postprocessing consisted 




Fig. 3. Improved results achieved with postprocessing. 
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of: inserting missing extrema points, changing the first nodes of contour paths 
(if desirable), and the optimization of merging pairs (or sequences) of Bezier 
segments together, and joining nodes in horizontal or vertical straight parts to 
eliminate redundant nodes. 

However, when this process was applied to the Malayalam fonts, we meet 
another problem: undetected corners in Figure 4. Instead of attempting to correct 
them, I stopped my postprocessing attempts, and switched to experiments with 
analytic methods of conversion. 




Fig. 4. T^Xtrace (AutoTrace) first without and then with postprocessing for the Malay- 
alam “a”, showing undetected corners. 



Examples of CM-super. Type 1 fonts [20] generated by Vladimir Volovich 
(first announced in 2001) inherit typical bugs produced by tracing bitmaps 
by AutoTrace (as invoked by T^Xtrace) such as bumps and holes, improper 
selection of starting points of contour paths, and problems in distinguishing 
sharp corners and small arcs. We illustrate them in several following figures, in 
order to demonstrate that fixing such irregularities automatically is difficult. 




In the period from the sfrmlOOO font (its source is the original cmrlO), 
an optimization cannot exclude the redundant node (Figure 5) (it is still the 
starting point of the path). 

The minus ” derived from cmrlO contains a bump, and minus from cmttlO 
two bumps (Figure 6). Moreover, these bumps have been hinted and have their 
own hints (probably as results of autohinting) . 
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Fig. 7. CM-super: “M” in sfrmlOOO and “i” in sfsilOOO. 



In the letter “M” from cmttlO, we observe missing dishes, a hole and a bad 
approximation of an arc (Figure 7). On the contrary, in “i” the corners are not 
detected properly, we also have a hinted bump. 

2.2 T^trace with potrace 

The 2003 version of T^Xtrace supports alternative bitmap tracing with potrace 
[17], developed by Peter Selinger. In this version, the real corners are detected 
or at least detected better than with AutoTrace (Figure 8). Thus, bumps and 
holes have been suppressed, but smooth connections have often been changed to 
sharp corners (not present originally). While the bumps demonstrated violation 
of consistency and may produce invalid hinting zone coordinates (Figure 6), 
the unwanted sharp corners mean loss of tangent continuity (the middle clip in 
Figure 8). Unfortunately, the approximation does not preserve horizontal and 
vertical directions (the right clip), the stem edges are oblique ~ the difference 
between the two arrows on the left edge is 2 units in the glyph coordinate space. 

2.3 mftrace 

Han- Wen Nienhuys created mftrace [15,3], a Python script that calls AutoTrace 
or potrace (as with T^Xtrace) to convert glyph bitmap images to outlines. The 
results of tracing are thus expected to be very similar to those of T^Xtrace. 
In fact, for the analyzed Indie fonts, they are identical, as we can see in the 
first image in Figure 9 (compare with T^Xtrace results in Figure 4). With the 
— simplify option, mftrace calls FontForge [22] (previously named PfaEdit) 
to execute postprocessing simplification; this helps to exclude redundant nodes 
from outline contours, as in the second image in Figure 9. 
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Fig. 8. T^Xtrace (using potrace), with different corners. 





Fig. 9. mftrace without and with — simplify. 



3 Analytic Conversions 

3.1 MetaTypel 

MetaTypel [8,9] is a programmable system for auditing, enhancing and gen- 
erating Type 1 fonts from METflPOST sources. MetaTypel was designed by 
Boguslaw Jackowski, Janusz M. Nowacki and Piotr Strzelczyk. The MetaTypel 
package is available from ftp://bop.eps.gda.pl/pub/metatypel [10]. 

This “auditing and enhancing” is a process of converting the Type 1 font into 
MetaTypel (text) files, generating proof sheets, analysis, making corrections and 
regenerating modified Type 1 fonts. It is an important tool for checking, verifying 
and improving existing Type 1 fonts. 

MetaTypel works with the METHPOST language. Therefore the METflFONT 
font sources must be converted/rewritten into METflPOST. Macro package 
extensions of METflPOST and other miscellaneous programs provide generation 
of proper structure of the Type 1 format, evaluate hints (not only the basic 
outline curves), and create pfb and also afm and pfm files. 







246 



Karel Piska 



During the rewriting process, users define several parameters of the Type 1 
font, including the PostScript font encoding - PostScript glyph names and 
their codes - because METHFONT sources do not contain this data in a form 
directly usable for Type 1 encoding vectors. METRFONT output commands 
have to be changed to their METRP05T alternatives. Similarly, it is necessary 
to substitute METRFONT commands not available in METRP05T, to define 
METRP05T variants of pen definitions and pen stroking, etc. 

Alternative METRP05T commands are defined in the MetaTypel files 
f ontbase .mp, plain_ex .mp, et al. Other (new) commands may be defined by the 
user. Correspondence between METRFONT and METRP05T is approximately as 
shown in the following table (of course, the details may vary from font to font): 



METRFONT 


METRP05T 


fill path; 


Fill path; 


draw path; 


pen_stroke() (path) (glyph) ; 




Fill glyph; 


penlabels (1 , 2) ; 


justlabels (1 , 2) ; 


beginchar ( . . . 


beginglyphC . . . 


endchar ; 


endglyph; 





Fig. 10. MetaTypel - primary outlines and overlap removal. 



Many METRFONT commands have no counterpart in METRP05T [6]. For 
example, operations with bitmap pictures: in METRP05T, font data is repre- 
sented as PostScript curves, not bitmaps. As a result, writing METRP05T code 
that would produce equivalent results as original METRFONT code using these 
or other such features would be very difficult. 

After the basic conversion, the next step is removing overlaps (if any are 
present) using the MetaTypel command f ind_outlines. Figure 10 shows the 
results before and after overlap removal for the Malayalam vowel a (font mmlO 
using pen stroking with a circular pen). This operation is not necessary in METR- 
FONT, since it generates bitmaps. In the METRP05T environment of PostScript 
outlines, however, we need to reduce overlapping curves to single or pairs of 
paths. 

MetaTypel also allows insertion of commands for automatic computation of 
horizontal and vertical hints (FixHStems, FixVStems). The Type 1 font can be 
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visualized in a proof sheet form containing the control point labels (numbers) 
and hinting zones (Figure 11). 

So far, so good. But there are two crucial problems. First, the METflFONT 
Malayalam fonts designed by Jeroen Hellingman [5], use the command 

currenttransf orm ;= currenttransf orm 

shifted (.5rm, 0); 

So all the glyphs should be shifted to the right. METflFONT saves the transforma- 
tion command and does this operation automatically. By contrast, in METflPOST 
we need to insert the shift commands explicitly in all glyph programs. Also the 
labels must be shifted! In my experiments, I did this shift operation later, before 
final assembly of the Type 1 fonts. 

The second problem is that in MetaTypel (I used MetaTypel version 0.40 
of 2003) a regular pen stroking algorithm is not available, only a simplified 
method of connecting the points ‘parallel’ to the nodes on the path. Therefore 
the approximation of the envelope is not correct. For example, in Figure 12 it 
should be asymmetric, but it is symmetric. Inserting additional nodes cannot 
help, because the bisection results will again be asymmetric. The Figure shows 
the outline curves do not correspond to the real pen in two midpoint locations. 
The envelope there looks narrow and it is in fact narrower than it should be. 
I hope that this problem could be solved in a future release, at least for pen 
stroking with a circular pen. 

Even more serious is a situation with the rotated elliptic pen used in the 
Devanagari fonts designed by Frans Velthuis [19] (and also other Indie fonts 
derived from dvng). Absence of a regular pen stroking in MetaTypel makes 
it impractical for such complicated fonts. MetaTypel approximates the pen 
statically in path nodes, tries to connect their static end points, and ignores 
complicated dynamic correlations between the path, the pen and the envelope. 
Unfortunately, in this case the results of the envelope approximation are not 
correct and cannot be used (Figure 13). 




Fig. 11. MetaTypel - proof sheet. 
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Fig. 13. MetaTypel - Devanagari “i”, “a”. 



3.2 MetaFog 

Two programs using analytic conversion were presented in 1995. Basil K. 
Malyshev created his BaKoMa collection [14] and Richard J. Kinch developed 
MetaFog [11]. BaKoMa is a PostScript and TrueType version of the Computer 
Modern fonts. Malyshev’s paper discusses some problems of conversion, espe- 
cially regarding hinting, but his programs and detailed information about the 
conversion algorithm are not available. 

R. Kinch created MetaFog along with weeder, which supports interactive 
processing of outlines, and a package for making final fonts from outlines 
generated by MetaFog in TrueType, Type 1 and other formats. MetaFog itself (I 
used an evaluation version graciously donated by Richard) reads the METRP05T 
output from the command: 

mpost ’&mf plain options;’ input fontname .mi 

Thus, the conversion (from METHFONT sources) is limited to fonts that can be 
processed by METHP05T, that is, do not contain METRFONT-specific definitions 
and commands. MetaFog generates another PostScript file consisting only of the 
outline structures. A conversion process is described also in the paper written 
by Taco Hoekwater [7] . 

MetaFog evaluates outline contours and precisely computes envelopes of an 
elliptical pen stroking along a Bezier curve. We must notice that the envelopes 
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Fig. 14. MetaFog - initial input contour and final result. 



in general are not cubic Bezier curves and their representation in a Type 1 font 
must be an approximation. The results for a circular pen, on the other hand, 
can be considered perfect. Figures 14 and 15 show an example of the Malayalam 
letter “a” (font mmlO): the initial and final contours and the final Type 1 font 
with control points (stroked version) and its visual comparison with METRP05T 
output embedded in a Type 3 font, respectively. 



m 



Fig. 15. MetaFog - final Type 1 font. 





Problems with Complex Pen- Stroking. A more complicated situation is 
the conversion of fonts using pen stroking with a rotated elliptical pen, such as 
the Devanagari font. Figure 16 illustrates this case. The initial input contour 
and final result contour (ttal) look good - in the first image we can see the 
projections of the pen in nodes corresponding to METPFONT source. But exact 
comparison with the original METRP05T output embedded in a Type 3 font 
(tta2) and primary MetaFog conversion displayed together with the METRP05T 
source (tta3) shows that this approximation is not correct. Because these ele- 
ments are very common in shapes of all but the simplest Devanagari glyphs, 
corrections are necessary. 

I therefore applied a simple pen-dependent preprocessing step before the 
MetaFog conversion, thus adapting the METRP05T output as a modified form 
of bisection, as discussed in a paper by R. Kinch [11]. The preprocessing scans 
curves, searching for points where the path direction and the direction of main 
axis of the pen coincide (namely 135°) and inserts these points as additional 
path nodes. In our case, the transformation matrix is cos 6* * [1, 1, —1, 1], so we 
solve only a quadratic equation and can find 0, 1 or 2 (at most) of these points. 
This technique corrects the MetaFog approximation of all such occurrences in 
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Fig. 16. MetaFog contours, METRPOST output, primary and secondary conversion on 
the METRPOST background. 



the dvng font. The result of this secondary MetaFog conversion with METRPOST 
source is shown in the last panel of Figure 16 (tta4). 

Similar improvements for the Devanagari letters “a” and “pha” are shown 
in figure 17. For “pha”, the first 135° node was already present in the path 
defined by the METRFONT source (first panel, phal); on the contrary, the second 
occurrence of a 135° point was absent, and therefore it was inserted in the 
METRPOST output (last panel, pha2). 

Of course, this improvement is not universal, it only solves a special problem 
with a special pen for a special font. 

Figure 18 illustrates movement of a rotated elliptical pen stepping along 
a “nice” path (panel 1). However, correlations with the pen are not trivial: 
changes of curvature of the outer wingtip curve do not have simple monotonic 
behavior, and the inner wingtip curve (panel 2) is even more complicated. This 
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Fig. 17. MetaFog output before and after modification of METflPOET source. 



means that the pen-stroked wingtip curves along a single Bezier curve cannot 
be approximated by single Bezier curves (compare with the starting Figure 16, 
panel ttal), i.e., an envelope edge of a pen along a simple path is not simple. 

Automatic Conversion Problems. A “dark side” of improving the curve 
approximation is a fragmentation of an envelope curve into many segments (often 
more than 10, and up to 16 in Devanagari!). We achieve a faithful approximation 
(limited only by numerical accuracy) at the expense of conciseness. To make up 
for this, postprocessing is needed. The original MetaFog output and a result 
of my (preliminary) optimization assembled into Type 1 fonts are shown in 
Figure 19. 

Unfortunately, even a small computational inaccuracy can make automatic 
conversion and optimization impossible, and even make it very difficult to design 
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Fig. 18. Wingtip curves in METRP05T source. 




postprocessing algorithms. In Figure 20, we demonstrate problems with the 
primary approximation of an envelope stroked by a rotated elliptical pen, and 
also difficulties with automatic optimization of the Devanagari ligature “d+g+r” . 

In the first panel of Figure 20, we observe an artifact produced by MetaFog 
due to a complicated correlation of the pen and the path. Fortunately, those 
cases are very rare (less than 1 % of glyphs in Devanagari) . 

In the second panel, the path and subsequently the corresponding envelope 
edges are not absolutely horizontal, thus (probably) MetaFog cannot properly 
find intersection points and join reconstructed outline components. Those defects 
are present in more than 12% of the Devanagari glyphs. In all cases, they have 
been successfully solved manually by the interactive weeder program. 

In the last two details in Figure 20 (the lower ending part of the left stem) we 
can see that both nodes of the left segment are outside the filled area boundary 
defined by the METRP05T curve. The outer wingtip edge is split there into many 
segments, some being straight lines ~ and they should not be, e.g., the first and 
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Fig. 20. MetaFog - problems with automatic conversion. 



the third segment marked by 2 arrows in the clip - their curvatures are for 
us undefined. Additionally, we cannot detect the last segment (magnified in the 
figure) as horizontal because its angle is “greater than some e” . 

Thus, neither node coordinates, nor segment directions, nor curvatures are 
reliable. It gives a visual comparison of the METRP05T output with its outline 
approximation. Therefore, my (first and “simple”) idea cannot succeed. This 
was to classify the behavior of directions and curvatures of all the segments 
automatically, and then to divide segments into groups according to directions 
and curvatures, then automatically merging the groups to single Bezier segments. 
As demonstrated, this optimization may fail or produce incorrect results and, 
unfortunately, human assistance is needed. 

4 Summary 

Here we summarize the most important features of the conversion programs 
found in our experiments. 

4.1 Approximate Conversions: TgXtrace, mftrace 

Advantages: 

^ approximation covers original METRFONT fonts and correspondence to pk 
bitmaps is (reasonably) close 

— simple invocation, robust solution 

— fully automatic processing can generate complete, final Type 1 fonts 
Disadvantages: 

— approximate conversions give only approximate outlines 

— lost information about nodes and other control points 
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— final fonts do not satisfy the Type 1 conventions 

— AutoTrace: problems with recognizing corners, generation of unwanted 
bumps and holes 

— potrace: sharp connections, thus loss of tangent continuity, violation of 
horizontal or vertical directions 

— automatic and correct (auto)hinting may yield poor results due to these 
irregularities 



4.2 MetaTypel 

Advantages: 

— complete support for Type 1 font generation 

— manual insertion of hinting information possible via simple hinting com- 
mands 

— font file compression via subroutines 
Disadvantages: 

— conversion of METflFONT to METHP05T often requires manual rewriting, 
possibly non-trivial and time-consuming 

— bad pen stroking algorithm; in particular, results for complicated fonts using 
rotated elliptical pens are unusable 

— difficulties with removing overlaps in tangential cases 



4.3 MetaFog 

Advantages: 

— fully automatic conversion of METHP05T output to outlines 

~ “typical” fonts usually achieve perfect results 

— even for very complex fonts (again, with rotated elliptical pens), adaptations 
of METflPOST output and manual editing with weeder make it plausible to 
obtain perfect outlines 

— results fulfill the Type 1 conventions in most cases (except for those very 
complex fonts) 

Disadvantages: 

— MetaFog reads METflPOST output, thus cannot process METflPONT-specific 
definitions 

— complex fonts may still need manual reduction with weeder or subsequent 
optimization of outlines to reach conciseness 

— processing is slow 
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4.4 Final Font Processing and Common Problems 

The conversion systems discussed here, with the exception of MetaTypel, do 
not include internal hinting subsystems. To insert hints, we can use font editors, 
for example FontForge [22]. For successful automatic hinting, however, the font 
outlines must fulfill certain conditions. Irregularities - absence of nodes at 
extrema or presence of bumps and holes - are not compatible with autohinting, 
because extrema points correspond to hinting zones while bumps or holes do 
not fit them, thus causing outliers. The resulting difference of ±1 unit in the 
integer glyph coordinate system, after rounding to integers, is not acceptable for 
high-quality fonts. Problems may also be caused by other “rounding to integer” 
effects, and by the presence of close doublets or triplets. 

In my view, these experiments show that the quality of primary outline 
approximation is crucial to achieve perfect final Type 1 fonts. It is virtually 
impossible to recreate discarded METRFONT information, or to find exact con- 
ditions for a secondary fit that corrects primary contours that were created with 
irregularities or artifacts. Starting with high-resolution bitmaps is problematic, 
as too much information has been lost, making subsequent processes of improve- 
ment, optimization and hinting difficult at best, not possible to automate and 
usually not successful. 
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Abstract. Two hundred years ago a font was a collection of small pieces 
of metal. Using that font required the services of a skilled typesetter to 
handle the niceties of kerning and ligatures. Modern fonts are expected 
to encapsulate both the letter shapes found on the pieces of metal, and 
the intelligence of the typesetter by providing information on how to 
position and replace glyphs as appropriate. As our view of typography 
extends beyond the familiar Latin, Greek and Cyrillic scripts into the 
more complex Arabic and Indie we need greater expressive power in the 
font itself. As of this writing there are two fairly common methods to 
inclnde these metainformation within a font, that used by Apple (GX 
technology) and that nsed by MicroSoft and Adobe (OpenType). I shall 
compare these two formats and describe how FontForge , a modern open 
source font editor, may be used to implement either or both. 



1 Introduction 

Modern fonts are more that just collections of glyph shapes, they must also 
contain information on how those glyphs are put together. In West-European 
typography the two most obvious examples of this are kerning and ligatures. 

In Arabic most character should have at least four variants depending on 
what other characters surround it, a vast number of ligatures and marks. 

Indie scripts require glyph rearrangement, and a complex system of glyph 
replacements. 

Apple has developed one way to describe these metainformation as part of its 
GX font technology, and MicroSoft and Adobe have developed another mech- 
anism as part of OpenType. Although both of these systems have the same 
ultimate goals the philosophy behind them, the mechanisms used, and the ex- 
pressive powers of the formats are markedly different. 

2 Comparing GX and OpenType 

On one level the difference between these two technologies is similar to the 
different approaches toward hinting used by PostScript and TrueType fonts. In 
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the approach used by PostScript and OpenType the font contains information 
describing what needs to be accomplished, while TrueType and GX provide 
provide little programs that actually accomplish it. 

GX puts a greater burden on the font designer. Writing a state machine that 
converts the glyph sequence “f” “i” to the “fi” glyph is harder than just stating 
that T’ “i” should become “fi.” 

Looking at a GX state machine and attempting to figure out what glyphs 
are converted into what ligatures in what situations is, in general, impossible 
(equivalent to the halting problem). While in an OpenType font this ligature 
composition is exactly the information provided in the font. 

Both technologies allow a font to attach a high level description (called a 
feature or feature setting) to a set of glyph transformations. Some features will 
be turned on automatically if appropriate conditions are met, others are designed 
to be invoked by the user when s/he deems it appropriate. 



3 Comparing GX and OpenType Transformations 

The expressive powers of the two formats for non-contextual transformations 
(simple kerning, substitutions, ligatures) are very similar, so similar that Font- 
Forge can use the same interface for both. 

In the area of contextual transformations the two formats differ wildly. Open- 
Type provides a series of fixed length patterns. At every glyph position in the 
glyph stream each pattern is tried in sequence. If one matches at that position 
then any transformations specified are applied and the process moves on to the 
next glyph position. 

GX provides a state machine. The state machine processes the entire glyph 
stream. There may be several places within the stream where a match is found, 
and at those places a maximum of two transformations may be applied. 

Neither format is a sub-set of the other. Both can express things the other 
cannot. 

A GX state machine can easily match the regular expression “ab*c” and 
then replace “a” with “d” . In OpenType this would require an infinite number 
of patterns. GX provides a mechanism for determining if a glyph is at the start 
or end of a text line (so swash substitutions could be made dependent on this) 
while OpenType does not. 

On the other hand replacing the string “abc” with “def” can be done easily in 
OpenType but not at all in GX (three substitutions need to be applied after the 
match is found, and GX only supports two). OpenType allows different types of 
substitutions to be applied within a pattern (for example a ligature at one point 
and a simple substitution at another) while GX does not. OpenType matches a 
pattern against every glyph position in the stream, and each pattern may have 
multiple substitutions. This means it is possible for several substitutions to be 
applied to one position. This cannot be expressed with GX. 

FontForge uses the same User Interface (or UI for short) UI for specifying 
non-contextual substitutions but must use different UIs for contextual ones. 
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4 Non-contextual Transformations in FontForge 

Many transformations affect one glyph at a time. In the Element -> Char Info 
dialog FontForge shows all the transformations that may affect the current glyph. 



W Ffl] Cha/ Info fot c 



I [x]B] 



Unicode Comment Pos | Pair 
Ligature Components Counters 
Subs Aft Subs I Mutt Subs 
sups 0 onesuperior 



New- 

Copy 



Delete 



Edit.. 



c Prev 



Done 



Edit Substitution Vana(3 0 



Components 

onesuperiod 



lag: 

sups 



Saipt & Languages: 



latn{cfflt} 



W Right To Left 
W Ignore Base Glyphs 
W Ignore Ligatures 
■ Ignore Combining Marks 



Cancel 



Fig. 1. Simple substitutions of “one” 



In this case the glyph named “one” can be replaced by the glyph “onesu- 
perior” under the control of the ‘sups’ feature (presumably the ‘sups’ feature 
would transform “two” to “twosuperior,” but that will be shown in the dialog 
for “two,” not here). 

Double-clicking on an entry in the list (or pressing the [New] button) will 
invoke another dialog showing information about the current substitution. 

Every transformation must have a tag, script, and flag set associated with 
it. The tag is either a four letter OpenType feature tag, or a two number GX 
feature/setting either format may be used to identify the transformation. Some 
OpenType features correspond directly to a GX feature setting (OpenType ‘sups’ 
matches GX 10,1), in this case FontForge will use the OpenType feature tag to 
describe the transformation (as the more mnemonic of the two) and will translate 
it into the appropriate GX tag when an Apple font is generated. 

The script and language information is only meaningful in OpenType fonts. 
It specifies the conditions under which the current transformation is to be ap- 
plied. Generally, the script is obvious from the current glyph, and the language 
information is irrelevant - the transformation should always be applied in that 
script and the language is set to “default.” FontForge can generally guess the 
appropriate script by looking at the current glyph. But some glyphs, like the 
digits, are used in multiple scripts (e.g., Latin-based scripts, the Greek script, 
and Gyrillic script among others share those glyphs), and in this case the user 
may need to adjust the script pulldown. 
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When outputting a GX table FontForge will ignore any transformations 
which do not have a “default” language in their script/language list. 

The right to left flag is meaningful in both GX and OpenType. The others 
are only meaningful in OpenType. FontForge is usually able to determine the 
proper setting of right to left by looking at the glyph itself, but you can always 
correct it if necessary. 



5 Contextual Transformations for OpenType 

Gontextual transformations have two different aspects. One is a set of simpler 
transformations that actually do the work, and the other provides the context in 
which they are active. Let us suppose the user has a script font where most letter 
combinations join at the baseline, but after a few (“b”, “o”, “v” and “w”) the join 
is at the x-height. After these letters we need a different variant of the following 
letter. So in OpenType the user must specify two different transformations, the 
first substitutes the normal form of a letter for the variant that looks right after 
a “b”, and the second specifies the context in which the first is active. 

The transformations which do all the work are created glyph by glyph much 
like any other simple substitution described above. The only difference is that 
instead of specifying a script/language combination you should use the special 
entry in the script menu called Nested . This tells FontForge that the sub- 
stitution will only be used under the control of a contextual transformation. 

Gontextual transformations features are complex to specify. These apply to 
more than one glyph at a time and are reachable through Element -> Font 
Info. There are essentially two different formats for these — contextual and chain- 
ing contextual. Ghaining contextual is a superset of contextual and allows you 
to use characters previously processed (earlier in the glyph stream) as part of 
your context. 





Fig. 2. OpenType Contextual formats 
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Such a feature requires a tag, script, and set of flags specified in a dialog very 
similar to that used for simple features. 

The context in these transformations may be specified as patterns of glyphs, 
patterns of classes of glyphs, or patterns of a choice of glyphs. 

The glyph stream is divided into three parts, glyphs before the current glyph 
which may not be changed but which are matched for context, a few glyphs 
starting at the current glyph which may be changed and are matched, and glyphs 
ahead of the current glyph which are only matched for context. 




Fig. 3. OpenType Contextual by classes 



The glyphs provided by the font are divided into different classes. There 
may be a different set of glyph classes for each region of the glyph stream. In 
the following example, there are three classes defined for the region around the 
current glyph. By convention class 0 contains all glyphs which are not explicitly 
allocated to another class. Class 1 contains the glyphs “b o v w b.high...” Class 2 
contains the glyphs “a c d e...”. The buttons underneath the class display allow 
the user to modify these classes. 

At the top of the screen are the patterns which will be matched. Here there 
is a single pattern, which says: “The character before the current glyph must be 
in class 1, and the current glyph must be in class 2. If this pattern is matched 
then the glyph which is at position 0 (the current glyph) should be altered by 
applying the simple substitution defined by feature tag ‘high’.” 

6 Contextual Transformations for GX 

As with OpenType there are two aspects to a GX contextual transformation. 
There is a set of simple non-contextual transformations which do the work, and 
a state machine which controls when a simple transformation is activated. 

The glyphs of the font are divided into a set of classes (each state machine 
defines its own class sets), GX provides four predefined classes some of these 
classes do not represent actual glyphs but concepts like “end of line” . The state 
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machine looks like a two dimensional array of transitions. On one axis we have the 
glyph classes, and on the other we have the states. GX provides two predefined 
states, start of text and start of line. Usually these two states are identical but 
the distinction is present if the user wishes to take advantage of it. 

The transitions are slightly different depending on the type of state machine. 
Here I shall only discuss contextual substitution transitions. If the state machine 
is in a given state, and the current glyph is in a given class, then the transition 
specified for that class in the given state will be applied to figure out what to 
do next. A transition specifies the next state (which may be the same as the 
current state), two flags, and two substitutions. One flag controls whether to 
read another glyph from the glyph stream, or continue with the same current 
glyph. The other flag allows you to mark the current glyph for future reference. 
Two transformations may be specified one applies to the current glyph, and the 
other applies to any previously marked glyph. Either substitution may be empty. 




Fig. 4. GX state machine types 



Again these state machines apply to more than one glyph at a time and are 
specified by Element -> Font Info. 

In a few cases, FontForge will be able to convert an OpenType contextual 
substitution to a GX state machine. The [Convert From OpenType] button at 
the bottom of the dialog will show a list of OpenType features FontForge thinks 
it can convert. You can also create your own (or edit an old) state machine from 
here. 

Every state machine must be associated with a GX feature/setting, and the 
context in which it executes may be controlled with the vertical and right to left 
check boxes. 

This state machine executes the same script example discussed earlier. Again 
the glyph set is divided into two interesting classes: those glyphs that are followed 
by a high join, and those glyphs that are not. The state machine has two real 
states: in one the current glyph needs to be converted to a high variant, in the 
other the current glyph remains the low variant. State 2 maps the current glyph 
to its high variant, while both states 0 & 1 retain the current glyph unchanged 
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Fig. 5. GX state machine 



(this example makes no distinction between the first two states, but the format 
requires that both be present). 

If we are in state 0 and we get a normal glyph (class 5) then we read the 
next glyph, stay in state 0 and nothing much happens. If we get a glyph that 
needs a high connection after it then we read the next glyph, change to state 2 
and nothing else happens. 

If we are in state 2 and we get any letter glyph then we apply the ’high’ 
substitution to convert that glyph to its high variant. If it is a normal glyph we 
drop back to state 0, but if it needs a high connection after it we stay in state 2. 

If the user wishes to change a transition s/he may click on it and a new dialog 
pops up giving control over the transition. The arrow buttons on the bottom of 
the dialog allow the user to change which transition s/he is editing. 

7 Summary 

Both OpenType and GX attempt to provide similar effects. For simple concepts 
(non-contextual substitutions like “superscript” ) FontForge uses the same format 
internally to express both. This means the user only needs to specify these data 
once rather than do so for both OpenType and GX. But the expressive powers 
of the more complicated concepts (contextual substitutions for example) are 
so different that FontForge requires users to specify this information in both 
formats. 
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